Skip to main content
×
×
Home

Selection of correction candidates for the normalization of Spanish user-generated content

  • M. MELERO (a1), M.R. COSTA-JUSSÀ (a2), P. LAMBERT (a1) and M. QUIXAL (a3)
Abstract

We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).

Copyright
References
Hide All
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., and Passonneau, R. 2011. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 30–8. Stroudsburg, PA: Association for Computational Linguistics.
Ahmed, F., Luca, E. W. D., and Nürnberger, A. 2010. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness. Research Journal on Computer Science and Computer Engineering with Applications (Polibits) 40: 3948 (ISSN ).
Alonso, L. 2010. Inisghts lingüísticos relativos a la normalización léxica de contenidos generados por usuarios. Subjetividad y Procesos cognitivos 14 (2): 2031 (Printed ISSN: , electronic ISSN :).
Aminian, M., Avontuur, T., Azar, Z., Balemans, I., Elshof, L., Newell, R., van Noord, N., Ntavelos, A., and van Zaanen, M. 2012. Assigning part-of-speech to Dutch tweets. In Melero, M. (ed.), Workshop “NLP can u tag #user-generated-content?! via lrec-conf.org,”Language Resources and Evaluation Conference, pp. 914.
Axelrod, A., He, X., and Gao, J. 2011. Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 355–62. Stroudsburg, PA: Association for Computational Linguistics.
Bender, E. M., Morgan, J. T., Oxley, M., Zachry, M., Hutchinson, B., Marin, A., Zhang, B., and Ostendorf, M. 2011. Annotating social acts: authority claims and alignment moves in wikipedia talk pages. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 4857. Stroudsburg, PA: Association for Computational Linguistics.
Bermingham, A., and Smeaton, A. 2010. Classifying sentiment in microblogs: is brevity an advantage? In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 1833–36. New York, NY: ACM.
Bilmes, J. A., and Kirchhoff, K., 2003. Factored language models and generalized parallel backoff. In Proceedings of HLT/NACCL, Edmonton, Alberta, Canada, pp. 46.
Brill, E., and Moore, R. C. 2000. An improved error model for noisy channel spelling correction.In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, pp. 282–93.
Brody, S., and Diakopoulos, N. 2011. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! using word lengthening to detect sentiment in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, pp. 562–70. Stroudsburg, PA: Association for Computational Linguistics.
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A., 2007. Investigation and modeling of the structure of texting language. International Journal on Document Analysis and Recognition 10 (3): 157–74.
Chung, T., and Gildea, D. 2009. Unsupervised tokenization for machine translation.In Conference on Empirical Methods in Natural Language Processing (EMNLP-09), Singapore, pp. 718–26.
Church, K., and Gale, W. 1990. Poor estimates of context are worse than none. In Third DARPA Workshop on Speech and Natural Language, Hidden Valley, PA.
Clark, E. 2003. Pre-processing very noisy text. In Proceedings of Workshop on Shallow Processing of Large Corpora, Lancaster University, UK. pp. 1222.
Clark, E., and Araki, K. 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual english. In 12th Conference of the Pacific Association for Computational Linguistics (PACLING 2011), Kuala Lumpur, Malaysia, paper 16.
Cook, P., and Stevenson, S. 2009. An unsupervised model for text message normalization. In CALC '09: Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Morristown, NJ, pp. 71–8. Stroudsburg, PA: Association for Computational Linguistics.
Damerau, F. J., 1964. A technique for computer detection and correction of errrors. Communications of the ACM 7: 171–6.
Eisenstein, J. 2013. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 359–69. Stroudsburg, PA: Association for Computational Linguistics.
Federico, M., Bertoldi, N., and Cettolo, M., 2008. Irstlm: an open source toolkit for handling large scale language models. In Interspeech, Brisbane, Australia, pp. 1618–21.
Foster, J. 2010. “cba to check the spelling”: investigating parser performance on discussion forum posts. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, pp. 381–84. Stroudsburg, PA: Association for Computational Linguistics.
Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., Van Genabith, J., et al. 2011. #hardtoparse: Pos tagging and parsing the twitterverse. In Proceedings of the Workshop On Analyzing Microtext (AAAI 2011), pp. 20–5.
Gianfortoni, P., Adamson, D., and Rosé, C. P. 2011. Modeling of stylistic variation in social media with stretchy patterns. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties (DIALECTS '11), pp. 4959. Stroudsburg, PA: Association for Computational Linguistics.
Gouws, S., Metzler, D., Cai, C., and Hovy, E. 2011. Contextual bearing on linguistic variation in social media. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, pp. 2029. Stroudsburg, PA: Association for Computational Linguistics.
Grefenstette, G., and Tapanainen, P., 1994. What is a word, what is a sentence? Problems of Tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, Budapest, Hungary, pp. 7987.
Han, B., and Baldwin, T., 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the ACL Conference, Portland, OR, pp. 368–78.
Hassan, H., and Menezes, A. 2013. Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 1577–86. Stroudsburg, PA: Association for Computational Linguistics.
Henríquez, Q. C. and Hernández, A. 2009. A n-gram based statistical machine translation approach for text normalization on chat-speak style communications. In CAW 2.0 Workshop, pp. 1–5.
Herring, S. 2012a. Discourse in Web 2.0: familiar, reconfigured, and emergent. In Tannen, D., and Tester, A. M. (eds.), Discourse 2.0: Language and New Media, Georgetown University Round Table on Languages and Linguistics 2011, Georgetown University, Washington, DC.
Herring, S. 2012b. Grammar and electronic communication. In Chapelle, C. (ed.), Encyclopedia of Applied Linguistics. Hoboken, NJ: Wiley-Blackwell. pp. 19.
Hodge, V. J., and Austin, J. 2003. A comparison of standard spell checking algorithms and novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering 15 (5), 1073–81.
Kobus, C., Yvon, F., and Damnati, G. 2008. Normalizing SMS: are two metaphors better than one?In Proceedings of the 22nd International Conference on Computational Linguistics (COLING '08), Manchester, UK, pp. 441–8.
Koehn, P., and Hoang, H. 2007. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 868–76.
Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24 (4), 377439.
Lambert, P., and Banchs, R. E., 2006. Tuning machine translation parameters with SPSA. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Kyoto, Japan, pp. 190–6.
Liu, F., Weng, F., and Jiang, X. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL (1)), pp. 1035–44. Stroudsburg, PA: Association for Computer Linguistics.
Mangu, L., and Brill, E. 1997. Automatic rule acquisition for spelling correction. In Proceedings of the 14th International Conference on Machine Learning, pp. 734–41. Burlington, MA: Morgan Kaufmann.
Martins, B., and Silva, M. J., 2004. Spelling correction for search engine queries. In EsTAL – España for Natural Language Processing, Alicante, Spain, pp. 378–83.
Maynard, D., Bontcheva, K., and Rout, D. 2012. Challenges in developing opinion mining tools for social media. In Melero, M. (ed.), Workshop “@NLP can u tag #user-generated- content?! via lrec-conf.org,”Language Resources and Evaluation Conference, Istanbul., pp. 1522.
Mays, E., Damerau, F. J., and Mercer, R. L., 1991. Context-based spelling correction. Information Processing and Management 27 (5): 517522.
Michelson, M., and Knoblock, C. A. 2005. Semantic annotation of unstructured and ungrammatical text. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1091–8.
Mohamed, E. 2011. The effect of automatic tokenization, vocalization, stemming, and POS tagging on Arabic dependency parsing. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, pp. 10–8. Stroudsburg, PA: Association for Computational Linguistics.
Moore, R. C., and Lewis, W. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, Uppsala, Sweden, pp. 220224. Stroudsburg, PA: Association for Computational Linguistics.
Muñoz-García, O., and Navarro, C. 2012. Comparing user-generated content published in different social media sources. In Melero, M. (ed.), Workshop “NLP can u tag #user-generated-content?! via lrec-conf.org,”Language Resources and Evaluation Conference, Istanbul., pp. 18.
Pang, B., and Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1135.
Quixal, M., Badia, T., Benavent, F., Boullosa, J. R., Domingo, J., Grau, B., Massó, G., and Valentín, O. 2008. User-centred design of error correction tools. In Chair, N. C. C., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., and Tapias, D. (eds.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. Luxembourg: European Language Resources Association (ELRA), pp. 1985–9. http://www.lrec-conf.org/proceedings/lrec2008/.
Ritter, A., Clark, S., Mausam, , and Etzioni, O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Workshop on Unsupervised Learning in NLP (EMNLP), Edinburgh, UK, pp. 15241534. Stroudsburg, PA: ACL.
Rodríguez, C., Banchs, R., Codina, J., and Grivolla, J. 2010. Cometa: semantic exploration of customer reviews to extract valuable information for business intelligence. Technical Report, Barcelona Media Innovation Center, Barcelona, Spain.
Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., and Estève, Y. 2011. Lium’s systems for the IWSLT 2011 speech translation tasks. In International Workshop on Spoken Language Translation, San Francisco, CA.
Spall, J. C., 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37 : 332–41.
Spall, J. C., 1998. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Technical Digest 19 (4): 482–92.
Sproat, R., Black, A. W., Chen, S. F., Kumar, S., Ostendorf, M., and Richards, C., 2001. Normalization of non-standard words. Computer Speech & Language 15 (3): 287333.
Stolcke, A. 2002. Srilm-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, pp. 257–86.
Toral, A. 2013. Hybrid selection of language model training data using linguistic information and perplexity. In Proceedings of the Second Workshop on Hybrid Approaches to Translation, Sofia, Bulgaria, pp. 812. Stroudsburg, PA: Association for Computational Linguistics.
Toutanova, K., and Moore, R. C. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the Associtation for Computational Linguistics, Hong Kong, pp. 144–51.
Villegas, M., Brosa, M. I., and Bel, N. 1998. El léxico PAROLE del español. In XIV Congreso de la Sociedad Española para el Procesamiento del Lenguaje, pp. 85–9.
Zhu, C., Tang, J., Li, H., Ng, H. T., and Zhao, T. 2007. A unified tagging approach to text normalization. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 688–95. Stroudsburg, PA: Association for Computational Linguistics.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed