How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Katja Zupan; Nikola Ljubešić; Tomaž Erjavec

doi:10.1017/S1351324919000366

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Published online by Cambridge University Press: 09 September 2019

and

Katja Zupan*: Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
Nikola Ljubešić: Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia
Tomaž Erjavec: Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
*: *Corresponding author. Email: nikola.ljubesic@ijs.si

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.

Keywords

Non-standard language Normalisation Domain adaptation Part-of-speech tagging

Type: Article
Information: Natural Language Engineering , Volume 25 , Special Issue 5: Natural Language Processing for Similar Languages, Varieties and Dialects , September 2019 , pp. 651 - 674

DOI: https://doi.org/10.1017/S1351324919000366 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.Google Scholar

Baron, A., and Rayson, P. (2008). Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate Conference in Corpus Linguistics.Google Scholar

Bartz, T., Beißwenger, M., and Storrer, A. (2014). Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics 28(1), 157–198.Google Scholar

Bennett, P., Durrell, M., Scheible, S., and Whitt, R.J. (2010). Annotating a historical corpus of German: A case study. In Proceedings of the LREC 2010 Workshop on Language Resource and Language Technology: Standards - State of the Art, Emerging Needs, and Future Developments, Paris, pp. 64–68.Google Scholar

Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In LAW@ ACL, pp. 11–18.Google Scholar

Bollmann, M., Dipper, S., Krasselt, J., and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In KONVENS, pp. 342–350.Google Scholar

Bollmann, M., Krasselt, J., and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling - Case studies from Early New High German. In Proceedings of KONVENS 2012 (LThist 2012 Workshop, pp. 342–350.Google Scholar

Bosch, Van Den, A., Busser, B., Canisius, S., and Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. LOT Occasional Series 7, 191–206.Google Scholar

Brants, T. (2000). TnT – a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP 2000), Seattle, WA, USA, pp. 224–31.CrossRef Google Scholar

Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479.Google Scholar

Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, Vol. 10. Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 1–8. Retrieved from https://doi.org/10.3115/1118693.1118694 CrossRef Google Scholar

De Clercq, O., Schulz, S., Desmet, B., Lefever, E., and Hoste, V. (2013). Normalization of Dutch user-generated content. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 179–188.Google Scholar

Derczynski, L., Chester, S., and Bøgh, K.S. (2015). Tune your Brown clustering, please. In International Conference Recent Advances in Natural Language Processing, RANLP, Vol. 2015, pp. 110–117.Google Scholar

Derczynski, L., Ritter, A., Clark, S., and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198–206.Google Scholar

Dipper, S. (2010). POS-tagging of historical language data: First experiments. In KONVENS, pp. 117–121.Google Scholar

Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., and Romih, M. (2015). Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1039.Google Scholar

Eisenstein, J. (2013). What to do about bad language on the Internet. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 359–369.Google Scholar

Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 33–38.Google Scholar

Erjavec, T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation 46(1), 131–142. Retrieved from http://dx.doi.org/10.1007/s10579-011-9174-8 CrossRef Google Scholar

Erjavec, T. (2014). Digital library and corpus of historical Slovene IMP 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1031 Google Scholar

Erjavec, T. (2015a). The IMP historical Slovene language resources. Language Resources and Evaluation, 1–23.Google Scholar

Erjavec, T. (2015b). Reference corpus of historical Slovene goo300k 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1025.Google Scholar

Erjavec, T., Fišer, D., čibej, J., Arhar Holdt, Š., Ljubešić, N., and Zupan, K. (2017). CMC training corpus Janes-Tag 2.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1123.Google Scholar

Etxeberria, I., Alegria, I., and Uria, L. (2019). Weighted finite-state transducers for normalization of historical texts. Natural Language Engineering 25, 307–321. https://doi.org/10.1017/S1351324918000505 CrossRef Google Scholar

Etxeberria, I., Alegria, I., Uria, L., and Hulden, M. (2016). Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia: European Language Resources Association (ELRA).Google Scholar

Fišer, D., Erjavec, T., and Ljubešić, N. 2016. JANES v0.4 : korpus slovenskih spletnih uporabniških vsebin (JANES 04: a corpus of Slovene User Generated Content. Slovenščina 2.0 4(2), 67–99. https://doi.org/10.4312/slo2.0.2016.2.67-99.Google Scholar

Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., … Van Genabith, J. (2011). # hardtoparse: POS tagging and parsing the twitterverse. In AAAI 2011 Workshop on Analyzing Microtext, pp. 20–25.Google Scholar

Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., … Smith, N.A. (2011). Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2, pp. 42–47.Google Scholar

Gomaa, W.H. and Fahmy, A.A. (2013). A survey of text similarity approaches. International Journal of Computer Applications 68(13).CrossRef Google Scholar

Greene, B. and Rubin, G. (1971). Automatic Grammatical Tagging of English. Department of Linguistics, Brown University. Retrieved from https://books.google.si/books?id=VznTygAACAAJ Google Scholar

Grčar, M., Krek, S., and Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik (obeliks: a statistical morphosyntactic tagger and lemmatiser for Slovene). In Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia.Google Scholar

Halácsy, P., Kornai, A., and Oravecz, C. (2007). HunPos: An open source trigram tagger. In Proceedings of the 45th Annual Meeting of the ACL, pp. 209–212. Stroudsburg: Association for Computational Linguistics. Retrieved from http://dblp.uni-trier.de/db/conf/acl/acl2007.html#HalacsyKO07 CrossRef Google Scholar

Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 368–378. Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2002472.2002520 Google Scholar

Hendrickx, I. and Marquilhas, R. (2011). From old texts to modern spellings: An experiment in automatic normalisation. JLCL 26(2), 65–76.Google Scholar

Horsmann, T. and Zesch, T. (2015). Effectiveness of domain adaptation approaches for social media POS tagging. CLiC it, 166–170.Google Scholar

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In EMNLP, Vol. 3, p. 413.Google Scholar

Kim, Y., Jernite, Y., Sontag, D., and Rush, A.M. (2016). Character-aware neural language models. In AAAI, pp. 2741–2749.Google Scholar

Koehn, P. (2017). Neural machine translation. CoRR, abs/1709.07809. Retrieved from http://arxiv.org/abs/1709.07809 Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., … Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180.CrossRef Google Scholar

Koehn, P., Och, F.J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1, pp. 48–54.CrossRef Google Scholar

Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., and Holz, N. (2015). Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1052.Google Scholar

Kurdi, M.Z. (2016). Natural Language Processing and Computational Linguistics: Speech, Morphology and Syntax, Vol. 1. John Wiley & Sons.CrossRef Google Scholar

Ling, W., Trancoso, I., Dyer, C., and Black, A.W. (2015). Character-based neural machine translation. arXiv preprint arXiv:1511.04586.Google Scholar

Ljubešić, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA).Google Scholar

Ljubešić, N., Erjavec, T., and Fišer, D. (2016). Corpus-based diacritic restoration for South Slavic languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA).Google Scholar

Ljubešić, N. and Erjavec, T. (2011). hrWaC and slWaC: Compiling web corpora for Croatian and Slovene. In Text, Speech and Dialogue, pp. 395–402.CrossRef Google Scholar

Ljubešić, N., Erjavec, T., and Fišer, D. (2017). Adapting a state-of-the-art tagger for South Slavic languages to non-standard text. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 60–68.CrossRef Google Scholar

Ljubešić, N., Klubička, F., Agić, Ž., and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Tenth International Conference on Language Resources and Evaluation (LREC 2016).Google Scholar

Ljubešić, N., Zupan, K., Fišer, D., and Erjavec, T. (2016). Normalising Slovene data: historical texts vs. user-generated content. Bochumer Linguistische Arbeitsberichte, 146–155.Google Scholar

Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T., and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 18–28). Santa Fe, New Mexico, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3902 Google Scholar

Matthews, D. (2007). Machine transliteration of proper names. Master’s Thesis, University of Edinburgh, Edinburgh.Google Scholar

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157. Retrieved from https://doi.org/10.1007/BF02295996 CrossRef Google Scholar PubMed

Metzler, D., Dumais, S., and Meek, C. (2007). Similarity measures for short segments of text. In European Conference on Information Retrieval, pp. 16–27.CrossRef Google Scholar

Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT, pp. 380–390.Google Scholar

Pettersson, E., Megyesi, B., and Nivre, J. (2013). Normalisation of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22–24, 2013; Oslo University; Norway. nealt Proceedings Series, Vol. 16, pp. 163–179.Google Scholar

Pettersson, E., Megyesi, B., and Nivre, J. (2014). A multilingual evaluation of three spelling normalisation methods for historical text. In EACL 2014 Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, European Association for Computational Linguistics, LaTeCH 2014, pp. 32–41.CrossRef Google Scholar

Pettersson, E., Megyesi, B., and Tiedemann, J. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013; May 22–24; 2013, Oslo, Norway: Nealt Proceedings Series, Vol. 18, pp. 54–69.Google Scholar

Piotrowski, M. (2012). Natural Language Processing for Historical Texts . Synthesis Lectures on Human Language Technologies. Morgan & Claypool.Google Scholar

Plank, B., Søgaard, A., and Goldberg, Y. (2016). Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv:1604.05529.Google Scholar

Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pp. 133–142. Retrieved from http://www.aclweb.org/anthology/W96-0213 Google Scholar

Rayson, P., Archer, D., Baron, A., Culpeper, J., and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics Conference: CL 2007. UCREL.Google Scholar

Ritter, A., Clark, S., and Etzioni, O. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534.Google Scholar

Scheible, S., Whitt, R.J., Durrell, M., and Bennett, P. (2011). Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 19–23.Google Scholar

Scheible, S., Whitt, R.J., Durrell, M., and Bennett, P. (2012). Gatetogermanc: A GATE-based annotation pipeline for historical German. In LREC, pp. 3611–3617.Google Scholar

Scherrer, Y. and Erjavec, T. (2016a). Modernising historical Slovene words. Natural Language Engineering 22(6), 881–905.CrossRef Google Scholar

Scherrer, Y. and Erjavec, T. (2016b). Modernising historical Slovene words. Natural Language Engineering, FirstView, 1–25. Retrieved from http://journals.cambridge.org/article_S1351324915000236 Google Scholar

Scherrer, Y. and Ljubešić, N. (2016). Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation. Bochumer Linguistische Arbeitsberichte, 248–255.Google Scholar

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK.Google Scholar

Silfverberg, M., Ruokolainen, T., Lindén, K., and Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pp. 259–264. Retrieved from http://aclweb.org/anthology/P/P14/P14-2043.pdf CrossRef Google Scholar

Consortium, TEI (2017). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium. Retrieved from http://www.tei-c.org/Guidelines/P5/ Google Scholar

Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., … Zervanou, K. (2017). The CLIN27 shared task : Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal 7, 53–64. Retrieved from http://www.clinjournal.org/sites/clinjournal.org/files/04.clin27-shared-task.pdf Google Scholar

Vilar, D., Peter, J.-T., and Ney, H. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 33–39.Google Scholar

Yang, Y. and Eisenstein, J. (2014). Unsupervised domain adaptation with feature embeddings. arXiv preprint arXiv:1412.4385.Google Scholar

Yang, Y. and Eisenstein, J. (2015). Unsupervised multi-domain adaptation with feature embeddings. In HLT-NAACL, pp. 672–682.CrossRef Google Scholar

Yang, Y. and Eisenstein, J. (2016). Part-of-speech tagging for historical English. arXiv preprint arXiv:1603.03144.Google Scholar

Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shon, S., Glass, J., … Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1–17, Santa Fe, New Mexico, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3901 Google Scholar

Article contents

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests