Skip to main content
×
Home
    • Aa
    • Aa

Social media text normalization for Turkish

  • GÜLŞEN ERYİǦİT (a1) and DİLARA TORUNOǦLU-SELAMET (a1)
Abstract
Abstract

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

Copyright
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

Md S. Akhtar , U. K. Sikdar , and A. Ekbal 2015. IITP: multiobjective differential evolution based Twitter named entity recognition. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 61–7.

I. Alegria , N. Aranberri , P. R. Comas , V. Fresno , P. Gamallo , L. Padró , San I. Vicente , J. Turmo , and A. Zubiaga , 2015. Tweetnorm: a benchmark for lexical normalization of Spanish tweets. Language Resources and Evaluation 49 (4): 883905.

A. Aw , M. Zhang , J. Xiao , and J. Su 2006. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL. Morristown, NJ, USA, pp. 3340.

R. Beckley 2015. Bekli: a simple approach to Twitter text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 82–6.

E. Clark , and K. Araki , 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia-social and Behavioral Sciences 27 : 211.

Y. Doval Mosquera , J. Vilares , and C. Gómez-Rodríguez 2015. Lysgroup: adapting a Spanish microtext normalization system to English. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 99105.

G. Eryiğit 2007. ITU treebank annotation tool. In Proceedings of Workshop on Linguistic Annotation (LAW) at ACL, Prague, Czech, pp. 117–20.

R. Eskander , M. Al-Badrashiny , N. Habash , and O. Rambow 2014. Foreign words and the automatic processing of Arabic social media text written in roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 112.

N. Jin 2015. Ncsu-sas-ning: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 8792.

C. Kobus , F. Yvon , and G. Damnati 2008. Normalizing sms: are two metaphors better than one? Proceedings of COLING, Manchester, UK, pp. 441–8.

V. Lacoste 2012. Phonological Variation in Rural Jamaican Schools, Creole language library. John Benjamins Publishing Company, Amsterdam.

S. Leeman-Munk , J. Lester , and J. Cox 2015. Ncsu_sas_sam: deep encoding and reconstruction for normalization of noisy text. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 154–61.

M. Melero , M. R. Costa-Jussà , P. Lambert , and M. Quixal , 2016. Selection of correction candidates for the normalization of Spanish user-generated content. Natural Language Engineering 22 (1): 135–61.

W. Min , and B. Mott 2015. Ncsu_sas_wookhee: a deep contextual long-short term memory model for text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 111–9.

A. Muhammad , N. Wiratunga , and R. Lothian 2015. Context-aware sentiment analysis of social media. In Advances in Social Media Analysis, Switzerland, pp. 87104.

F. J. Och , and H. Ney , 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29 (1): 1951.

H. Sak , T. Güngör , and M. Saraçlar 2011. Resources for Turkish morphological processing. Language Resources and Evaluation 45 (2): pp. 249–61.

M. A. Saloot , N. Idris , and R. Mahmud 2014. An architecture for Malay tweet normalization. Information Processing & Management 50 (5): pp. 621–33.

M. Sanches Duran , Volpe M. das Graças Nunes , and L. Avanço 2015. A normalizer for UGC in Brazilian Portuguese. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 3847.

R. Sarikaya , K. Kirchhoff , T. Schultz , and D. Hakkani-Tur , 2009. Introduction to the special issue on processing morphologically rich languages. IEEE Transactions on Audio, Speech, and Language Processing 17 (5): 861–2.

S. Schulz , G. De Pauw , O. De Clercq , B. Desmet , V. Hoste , W. Daelemans , and L. Macken , 2016. Multimodular text normalization of Dutch user-generated content. ACM Transactions on Intelligent Systems and Technology 7 (4): 122.

G. Şeker , and G. Eryiğit , 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content. Semantic Web Journal 8 (5): 625–42.

M. Silfverberg , P. Kauppinen , and K. Lindén 2016. Data-driven spelling correction using weighted finite-state methods. In Proceedings of the Workshop on Statistical NLP and Weighted Automa, Berlin, Germany, pp. 51–9.

T. Solorio , E. Blair , S. Maharjan , S. Bethard , M. Diab , M. Ghoneim , A. Hawwari , F. AlGhamdi , J. Hirschberg , A. Chang , and P. Fung 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 6272.

R. Sridhar , and V. Kumar 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing at ACL, Denver, Colorado, pp. 816.

D. Supranovich , and V. Patsepnia 2015. Ihs_rd: lexical normalization for English tweets. Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 7881.

J. Wagner , and J. Foster 2015. Dcu-adapt: learning edit operations for microblog normalisation with the generalised perceptron. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 93–8.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 7
Total number of PDF views: 39 *
Loading metrics...

Abstract views

Total abstract views: 223 *
Loading metrics...

* Views captured on Cambridge Core between 2nd June 2017 - 24th September 2017. This data will be updated every 24 hours.