Skip to main content
×
×
Home

Social media text normalization for Turkish

  • GÜLŞEN ERYİǦİT (a1) and DİLARA TORUNOǦLU-SELAMET (a1)
Abstract
Abstract

Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Research in this field is very recent and mostly on English. As is known from different areas of natural language processing, morphologically rich languages (MRLs) pose many different challenges when compared to English. Turkish is a strong representative of MRLs and has particular normalization problems that may not be easily solved by a single-stage pure statistical model. This article introduces the first work on the social media text normalization of an MRL and presents the first complete social media text normalization system for Turkish. The article conducts an in-depth analysis of the error types encountered in Web 2.0 Turkish texts, categorizes them into seven groups and provides solutions for each of them by dividing the candidate generation task into separate modules working in a cascaded architecture. For the first time in the literature, two manually normalized Web 2.0 datasets are introduced for Turkish normalization studies. The exact match scores of the overall system on the provided datasets are 70.40 per cent and 67.37 per cent (77.07 per cent with a case insensitive evaluation).

Copyright
References
Hide All
Adalı K., and Eryiğit G. 2014. Vowel and diacritic restoration for social media texts (LASM) at EACL. In Proceedings of 5th Workshop on Language Analysis for Social Media, Gothenburg, Sweden, pp. 5361.
Ageno A., Comas P. R., Padró L., and Turmo J. 2013. The TALP-UPC approach to Tweet-Norm 2013. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, p. 58.
Akhtar Md S., Sikdar U. K., and Ekbal A. 2015. IITP: multiobjective differential evolution based Twitter named entity recognition. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 61–7.
Akın A. A., and Akın M. D. 2007. Zemberek, an open source nlp framework for Turkic languages.
Alegria I., Aranberri N., Comas P. R., Fresno V., Gamallo P., Padró L., San Vicente I., Turmo J., and Zubiaga A., 2015. Tweetnorm: a benchmark for lexical normalization of Spanish tweets. Language Resources and Evaluation 49 (4): 883905.
Alegria I., Aranberri N., Fresno V., Gamallo P., Padró L., San Vicente I., Turmo J., and Zubiaga A. 2013. Introducción a la tarea compartida tweet-norm 2013: normalización léxica de tuits en Español. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 19.
Alex B., Dubey A., and Keller F. 2007. Using foreign inclusion detection to improve parsing performance. In Proceedings of EMNLP-CONLL, Prague, Czech, pp. 151–60.
Aw A., Zhang M., Xiao J., and Su J. 2006. A phrase-based statistical model for sms text normalization. In Proceedings of the COLING/ACL. Morristown, NJ, USA, pp. 3340.
Baldwin T., Kim Y.-B., de Marneffe M. C., Ritter A., Han B., and Xu W. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: twitter lexical normalization and named entity recognition. In Proceedings of ACL-IJCNLP 2015, Beijing, China, p. 126.
Baldwin T., and Li Y. 2015. An in-depth analysis of the effect of text normalization in social media. In Proceedings of NAACL, Denver, Colorado, pp. 420–9.
Beaufort R., Roekhaut S., Cougnon L.-A., and Fairon C. 2010. A hybrid rule/model-based finite-state framework for normalizing sms messages. In Proceedings of ACL ’10, Stroudsburg, PA, USA, pp. 770–9.
Beckley R. 2015. Bekli: a simple approach to Twitter text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 82–6.
Berend G., and Tasnádi E. 2015. Uszeged: correction type-sensitive normalization of English tweets using efficiently indexed n-gram statistics. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 120–5.
Blevins T., Kwiatkowski R., Macbeth J., McKeown K., Patton D., and Rambow O. 2016. Automatically processing tweets from gang-involved youth: towards detecting loss and aggression. In Proceedings of COLING. Osaka, Japan, pp. 2196–206.
Clark E., and Araki K., 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia-social and Behavioral Sciences 27 : 211.
Cook P., and Stevenson S. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity at NAACL-HLT, Stroudsburg, PA, USA, pp. 71–8.
Crystal D. 2008. Txtng: The gr8 db8. OUP Oxford, New York.
Das A., and Gambäck B., 2013. Code-mixing in social media text: the last language identification frontier. Traitement Automatique des Langues (TAL): Special Issue on Social Networks and NLP 54 (3): 6579.
De Clercq O., Desmet B., Schulz S., Lefever E., and Hoste V. 2013. Normalization of Dutch user-generated content. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp. 179–88.
Doval Mosquera Y., Vilares J., and Gómez-Rodríguez C. 2015. Lysgroup: adapting a Spanish microtext normalization system to English. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 99105.
Eger S., et al. 2016. A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. The Prague Bulletin of Mathematical Linguistics 105 (1): 7799.
Egidio Y. M. O. F. P., and Coupé M. C. 2013. A quantitative and typological approach to correlating linguistic complexity. In Proceedings of the 5th Conference on Quantitative Investigations in Theoretical Linguistics, University of Leuven, pp. 71–5.
Eisenstein J. 2013a. Phonological factors in social media writing. In Proceedings of the Workshop on Language Analysis in Social Media, Atlanta, Georgia: Association for Computational Linguistics, pp. 11–9.
Eisenstein J. 2013b. What to do about bad language on the internet. In Proceedings of NAACL-HLT, Atlanta, Georgia, pp. 359–69.
Eryiğit G. 2007. ITU treebank annotation tool. In Proceedings of Workshop on Linguistic Annotation (LAW) at ACL, Prague, Czech, pp. 117–20.
Eryiğit G. 2014. ITU Turkish NLP web service. In Proceedings of the Demonstrations at EACL, Gothenburg, Sweden, pp. 18
Eryiğit G., and Adalı E. 2004. An affix stripping morphological analyzer for Turkish. In Proceedings of the International Conference on Artificial Intelligence and Applications, Inssbruck, pp. 299304.
Eryigit G., Cetin F. S., Yanık M., Temel T., and Ciçekli I. 2013. Turksent: a sentiment annotation tool for social media. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse at ACL, Sofia, Bulgaria, pp. 131–4.
Eskander R., Al-Badrashiny M., Habash N., and Rambow O. 2014. Foreign words and the automatic processing of Arabic social media text written in roman script. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 112.
Gal Y. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of Workshop on Computational Approaches to Semitic Languages at ACL, Stroudsburg, PA, USA, pp. 17.
Hakkani-Tür D. Z., Oflazer K., and Tür G. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of COLING Stroudsburg, PA, USA, pp. 285–91.
Han B., and Baldwin T. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of ACL-HLT, Portland, Oregon, USA, pp. 368–78.
Han B., Cook P., and Baldwin T. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology (TIST) 4 (1): 5:1–27.
Hassan H., and Menezes A. 2013. Social text normalization using contextual graph random walks. In Proceedings of ACL, Sofia, Bulgaria, pp. 1577–86.
Ingason A. K., Jóhannsson S. B., Rögnvaldsson E., Loftsson H., and Helgadóttir S. 2009. Context-sensitive spelling correction and rich morphology. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA), Odense, Denmark, pp. 231–4.
Jahjah V., Khoury R., and Lamontagne L. 2016. Word Normalization using Phonetic Signatures, pp. 180–5. Cham: Springer International Publishing.
Jhamtani H., Bhogi S. K., and Raychoudhury V. 2014. Word-level language identification in bi-lingual code-switched texts. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation, Phuket, Thailand, pp. 348–57.
Jia Y., Huang D., Liu W., Dong Y., Yu S., and Wang H. 2008. Text normalization in Mandarin text-to-speech system. In Acoustics, Speech and Signal Processing (ICASSP), pp. 4693–6. IEEE, Las Vegas.
Jin N. 2015. Ncsu-sas-ning: candidate generation and feature engineering for supervised lexical normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 8792.
Kaufmann M., and Kalita J. 2010. Syntactic normalization of Twitter messages. In Proceedings of the 8th International Conference on Natural Language Processing (ICON), Chennai, India, pp. 17
Khan O. A., and Karim A. 2012. A rule-based model for normalization of sms text. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, pp. 634–41.
Kobus C., Yvon F., and Damnati G. 2008. Normalizing sms: are two metaphors better than one? Proceedings of COLING, Manchester, UK, pp. 441–8.
Kukich K., 1992. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR) 24 (4): 377439.
Labov W. 1969. A Study of Non-Standard English, Educational resources information center. ERIC Clearinghouse for Linguistics, Washington. D.C.
Lacoste V. 2012. Phonological Variation in Rural Jamaican Schools, Creole language library. John Benjamins Publishing Company, Amsterdam.
Lafferty J., McCallum A., and Pereira F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, USA, pp. 282–9.
Leeman-Munk S., Lester J., and Cox J. 2015. Ncsu_sas_sam: deep encoding and reconstruction for normalization of noisy text. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 154–61.
Leeman-Munk S. P. 2016. Morphosyntactic Neural Analysis for Generalized Lexical Normalization. Ph.D. thesis, North Carolina State University.
Li C., and Liu Y. 2014. Improving text normalization via unsupervised model and discriminative reranking. In Proceedings of the ACL Student Research Workshop, Baltimore, Maryland, USA, pp. 8693.
Limsopatham N., and Collier N. 2015. Adapting phrase-based machine translation to normalise medical terms in social media messages. In Proceedings of EMNLP, Lisbon, Portugal, pp. 1675–80.
Liu F., Weng F., and Jiang X. 2012. A broad-coverage normalization system for social media language. In Proceedings of ACL, Stroudsburg, PA, USA, pp. 1035–44.
Lui M., Lau J. H., and Baldwin T., 2014. Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2 : 2740.
Max A., and Wisniewski G. 2010. Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In Proceedings of LREC, Valletta, Malta, pp. 3143–8.
McCallum A. K. 2002. Mallet: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu.
McKean E. 2005. The New Oxford American Dictionary, vol. 2. New York: Oxford University Press.
Melero M., Costa-Jussà M. R., Lambert P., and Quixal M., 2016. Selection of correction candidates for the normalization of Spanish user-generated content. Natural Language Engineering 22 (1): 135–61.
Microsoft,. 2010. Microsoft Word, Version 10.0. Microsoft.
Min W., and Mott B. 2015. Ncsu_sas_wookhee: a deep contextual long-short term memory model for text normalization. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 111–9.
Muhammad A., Wiratunga N., and Lothian R. 2015. Context-aware sentiment analysis of social media. In Advances in Social Media Analysis, Switzerland, pp. 87104.
Nguyen T.-T., Thi P., Thanh T., and Tran D.-D. 2010. A method for Vietnamese text normalization to improve the quality of speech synthesis. In Proceedings of the 2010 Symposium on Information and Communication Technology, New York, NY, USA, pp. 7885.
Och F. J., and Ney H., 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics 29 (1): 1951.
Oflazer K., 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 7389.
Pamay T., Sulubacak U., Torunoğlu-Selamet D., and Eryiğit G. 2015. The annotation process of the ITU web treebank. In Proceedings of LAW Workshop at NAACL, Denver, Colorado, pp. 95101.
Panchapagesan K., Talukdar P. P., Krishna N. S., Bali K., and Ramakrishnan A. G. 2004. Hindi text normalization. In Proceedings of the 5th International Conference on Knowledge Based Computer Systems, India, pp. 1922.
Pennell D., and Liu Y. 2011. A character-level machine translation approach for normalization of sms abbreviations. In Proceedings of the International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 974–82.
Pirinen T. A., and Lindén K. 2010. Finite-state spell-checking with weighted language and error models. In Proceedings the Workshop on Creation and Use of Basic Lexical Resources for Less-Resourced Languages at LREC, Valetta, Malta, pp. 13–8.
Pirinen T. A., and Lindén K. 2014. State-of-the-art in weighted finite-state spell-checking. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Kathmandu, Nepal, pp. 519–32.
Porta J., and Sancho J.-L. 2013. Word normalization in Twitter using finite-state transducers. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 4953.
Qian T., Zhang Y., Zhang M., Ren Y., and Ji D. 2015. A transition-based model for joint segmentation, pos-tagging and normalization. In Proceedings of EMNLP, Lisbon, Portugal, pp. 1837–46.
Şahin M., Sulubacak U., and Eryiğit G. 2013. Redefinition of Turkish morphology using flag diacritics. Proceedings of the 10th Symposium on Natural Language Processing (SNLP-2013), Pukhet, Thailand, pp. 18.
Sak H., Güngör T., and Saraçlar M. 2011. Resources for Turkish morphological processing. Language Resources and Evaluation 45 (2): pp. 249–61.
Saloot M. A., Idris N., and Mahmud R. 2014. An architecture for Malay tweet normalization. Information Processing & Management 50 (5): pp. 621–33.
Sanches Duran M., Volpe Nunes M. das Graças, and Avanço L. 2015. A normalizer for UGC in Brazilian Portuguese. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 3847.
Sarikaya R., Kirchhoff K., Schultz T., and Hakkani-Tur D., 2009. Introduction to the special issue on processing morphologically rich languages. IEEE Transactions on Audio, Speech, and Language Processing 17 (5): 861–2.
Say B., Zeyrek D., Oflazer K., and Özge U. 2002. Development of a corpus and a treebank for present-day written Turkish. In Proceedings of the 11th International Conference of Turkish Linguistics, Northern Cyprus.
Schulz S., Pauw G. De, Clercq O. De, Desmet B., Hoste V., Daelemans W., and Macken L., 2016. Multimodular text normalization of Dutch user-generated content. ACM Transactions on Intelligent Systems and Technology 7 (4): 122.
Şeker G. A., and Eryiğit G. 2012. Initial explorations on using CRFs for Turkish named entity recognition. In Proceedings of COLING 2012, Bombay, India, pp. 2459–74.
Şeker G., and Eryiğit G., 2017. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content. Semantic Web Journal 8 (5): 625–42.
Silfverberg M., Kauppinen P., and Lindén K. 2016. Data-driven spelling correction using weighted finite-state methods. In Proceedings of the Workshop on Statistical NLP and Weighted Automa, Berlin, Germany, pp. 51–9.
Smith A., Cohn T., and Osborne M. 2005. Logarithmic opinion pools for conditional random fields. In Proceedings of ACL, Ann Arbor, Michigan, USA, pp. 1825.
Solorio T., Blair E., Maharjan S., Bethard S., Diab M., Ghoneim M., Hawwari A., AlGhamdi F., Hirschberg J., Chang A., and Fung P. 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching at ACL, Doha, Qatar, pp. 6272.
Sridhar R., and Kumar V. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing at ACL, Denver, Colorado, pp. 816.
Supranovich D., and Patsepnia V. 2015. Ihs_rd: lexical normalization for English tweets. Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 7881.
Torunoǧlu D., and Eryiğit G. 2014. A cascaded approach for social media text normalization of Turkish. In Proceedings of the 5th Workshop on Language Analysis for Social Media at EACL, Gothenburg, Sweden, pp. 6270.
Torunoğlu-Selamet D., Bekar E., Ilbay T., and Eryiğit G. 2016. Exploring spelling correction approaches for Turkish. In Proceedings of the 1st International Conference on Turkic Computational Linguistics at CICLING, Konya, pp. 711.
Tsarfaty R., Seddah D., Goldberg Y., Kübler S., Candito M., Foster J., Versley Y., Rehbein I., and Tounsi L. 2010. Statistical parsing of morphologically rich languages (SPMRL): what, how and whither. In Proceedings of the 1st Workshop on Statistical Parsing of Morphologically Rich Languages at NAACL-HLT, Stroudsburg, PA, USA, pp. 112.
Tür G. 2000. A Statistical Information Extraction System for Turkish. PhD Thesis, Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, Ankara.
Tür G., Hakkani-Tür D., and Oflazer K., 2003. A statistical information extraction system for Turkish. Natural Language Engineering 9 (2): 181210.
Vilares J., Alonso M., and Vilares D. 2013. Prototipado rápido de un sistema de normalización de tuits: una aproximación léxica. In Proceedings of the Tweet Normalization Workshop (TWEET-NORM) at SEPLN, Madrid, Spain, pp. 3943.
Wagner J., and Foster J. 2015. Dcu-adapt: learning edit operations for microblog normalisation with the generalised perceptron. In Proceedings of the Workshop on Noisy User-Generated Text at ACL, Beijing, China, pp. 93–8.
Wang P., and Ng H. T. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of NAACL-HLT, Atlanta, Georgia, pp. 471–81.
Wang Z., Xu G., Li H., and Zhang M. 2011. A fast and accurate method for approximate string search. In Proceedings of ACL-HLT, Stroudsburg, PA, USA, pp. 5261.
Xu K., Xia Y., and Lee C.-H. 2015. Tweet normalization with syllables. In Proceedings of ACL-IJCNLP, Beijing, China, pp. 920–8.
Yang Y., and Eisenstein J. 2013. A log-linear model for unsupervised text normalization. In Proceedings of EMNLP, Seattle, Washington, USA, pp. 6172.
Yüret D., and De La Maza M. 2006. The greedy prepend algorithm for decision list induction. In Proceedings of the 21st International Conference on Computer and Information Sciences, Berlin, Heidelberg, pp. 3746.
Zhang C., Baldwin T., Ho H., Kimelfeld B., and Li Y. 2013. Adaptive parser-centric text normalization. In Proceedings of ACL, Sofia, Bulgaria, pp. 1159–68.
Zhang Q., Chen H., and Huang X. 2014. Chinese-English mixed text normalization. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, NY, USA, pp. 433–42.
Zitouni I., Sorensen J., and Sarikaya R. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of COLING-ACL, Stroudsburg, PA, USA, pp. 577–84.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 14
Total number of PDF views: 101 *
Loading metrics...

Abstract views

Total abstract views: 502 *
Loading metrics...

* Views captured on Cambridge Core between 2nd June 2017 - 21st January 2018. This data will be updated every 24 hours.