Hostname: page-component-77c89778f8-m8s7h Total loading time: 0 Render date: 2024-07-16T14:54:25.478Z Has data issue: false hasContentIssue false

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Published online by Cambridge University Press:  09 September 2019

Katja Zupan*
Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
Nikola Ljubešić
Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia
Tomaž Erjavec
Affiliation:
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia
*
*Corresponding author. Email: nikola.ljubesic@ijs.si

Abstract

Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.Google Scholar
Baron, A., and Rayson, P. (2008). Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate Conference in Corpus Linguistics.Google Scholar
Bartz, T., Beißwenger, M., and Storrer, A. (2014). Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics 28(1), 157198.Google Scholar
Bennett, P., Durrell, M., Scheible, S., and Whitt, R.J. (2010). Annotating a historical corpus of German: A case study. In Proceedings of the LREC 2010 Workshop on Language Resource and Language Technology: Standards - State of the Art, Emerging Needs, and Future Developments, Paris, pp. 6468.Google Scholar
Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In LAW@ ACL, pp. 1118.Google Scholar
Bollmann, M., Dipper, S., Krasselt, J., and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In KONVENS, pp. 342350.Google Scholar
Bollmann, M., Krasselt, J., and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling - Case studies from Early New High German. In Proceedings of KONVENS 2012 (LThist 2012 Workshop, pp. 342350.Google Scholar
Bosch, Van Den, A., Busser, B., Canisius, S., and Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. LOT Occasional Series 7, 191206.Google Scholar
Brants, T. (2000). TnT – a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP 2000), Seattle, WA, USA, pp. 224–31.CrossRefGoogle Scholar
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., and Lai, J.C. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4), 467479.Google Scholar
Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, Vol. 10. Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 18. Retrieved from https://doi.org/10.3115/1118693.1118694 CrossRefGoogle Scholar
De Clercq, O., Schulz, S., Desmet, B., Lefever, E., and Hoste, V. (2013). Normalization of Dutch user-generated content. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pp. 179188.Google Scholar
Derczynski, L., Chester, S., and Bøgh, K.S. (2015). Tune your Brown clustering, please. In International Conference Recent Advances in Natural Language Processing, RANLP, Vol. 2015, pp. 110117.Google Scholar
Derczynski, L., Ritter, A., Clark, S., and Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In RANLP, pp. 198206.Google Scholar
Dipper, S. (2010). POS-tagging of historical language data: First experiments. In KONVENS, pp. 117121.Google Scholar
Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., and Romih, M. (2015). Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1039.Google Scholar
Eisenstein, J. (2013). What to do about bad language on the Internet. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 359369.Google Scholar
Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 3338.Google Scholar
Erjavec, T. (2012). MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation 46(1), 131142. Retrieved from http://dx.doi.org/10.1007/s10579-011-9174-8 CrossRefGoogle Scholar
Erjavec, T. (2014). Digital library and corpus of historical Slovene IMP 1.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1031 Google Scholar
Erjavec, T. (2015a). The IMP historical Slovene language resources. Language Resources and Evaluation, 123.Google Scholar
Erjavec, T. (2015b). Reference corpus of historical Slovene goo300k 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1025.Google Scholar
Erjavec, T., Fišer, D., čibej, J., Arhar Holdt, Š., Ljubešić, N., and Zupan, K. (2017). CMC training corpus Janes-Tag 2.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1123.Google Scholar
Etxeberria, I., Alegria, I., and Uria, L. (2019). Weighted finite-state transducers for normalization of historical texts. Natural Language Engineering 25, 307321. https://doi.org/10.1017/S1351324918000505 CrossRefGoogle Scholar
Etxeberria, I., Alegria, I., Uria, L., and Hulden, M. (2016). Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia: European Language Resources Association (ELRA).Google Scholar
Fišer, D., Erjavec, T., and Ljubešić, N. 2016. JANES v0.4 : korpus slovenskih spletnih uporabniških vsebin (JANES 04: a corpus of Slovene User Generated Content. Slovenščina 2.0 4(2), 6799. https://doi.org/10.4312/slo2.0.2016.2.67-99.Google Scholar
Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., Van Genabith, J. (2011). # hardtoparse: POS tagging and parsing the twitterverse. In AAAI 2011 Workshop on Analyzing Microtext, pp. 2025.Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Smith, N.A. (2011). Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2, pp. 4247.Google Scholar
Gomaa, W.H. and Fahmy, A.A. (2013). A survey of text similarity approaches. International Journal of Computer Applications 68(13).CrossRefGoogle Scholar
Greene, B. and Rubin, G. (1971). Automatic Grammatical Tagging of English. Department of Linguistics, Brown University. Retrieved from https://books.google.si/books?id=VznTygAACAAJ Google Scholar
Grčar, M., Krek, S., and Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik (obeliks: a statistical morphosyntactic tagger and lemmatiser for Slovene). In Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia.Google Scholar
Halácsy, P., Kornai, A., and Oravecz, C. (2007). HunPos: An open source trigram tagger. In Proceedings of the 45th Annual Meeting of the ACL, pp. 209–212. Stroudsburg: Association for Computational Linguistics. Retrieved from http://dblp.uni-trier.de/db/conf/acl/acl2007.html#HalacsyKO07 CrossRefGoogle Scholar
Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 368–378. Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2002472.2002520 Google Scholar
Hendrickx, I. and Marquilhas, R. (2011). From old texts to modern spellings: An experiment in automatic normalisation. JLCL 26(2), 6576.Google Scholar
Horsmann, T. and Zesch, T. (2015). Effectiveness of domain adaptation approaches for social media POS tagging. CLiC it, 166170.Google Scholar
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In EMNLP, Vol. 3, p. 413.Google Scholar
Kim, Y., Jernite, Y., Sontag, D., and Rush, A.M. (2016). Character-aware neural language models. In AAAI, pp. 27412749.Google Scholar
Koehn, P. (2017). Neural machine translation. CoRR, abs/1709.07809. Retrieved from http://arxiv.org/abs/1709.07809 Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177180.CrossRefGoogle Scholar
Koehn, P., Och, F.J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1, pp. 4854.CrossRefGoogle Scholar
Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., and Holz, N. (2015). Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1052.Google Scholar
Kurdi, M.Z. (2016). Natural Language Processing and Computational Linguistics: Speech, Morphology and Syntax, Vol. 1. John Wiley & Sons.CrossRefGoogle Scholar
Ling, W., Trancoso, I., Dyer, C., and Black, A.W. (2015). Character-based neural machine translation. arXiv preprint arXiv:1511.04586.Google Scholar
Ljubešić, N. and Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA).Google Scholar
Ljubešić, N., Erjavec, T., and Fišer, D. (2016). Corpus-based diacritic restoration for South Slavic languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA).Google Scholar
Ljubešić, N. and Erjavec, T. (2011). hrWaC and slWaC: Compiling web corpora for Croatian and Slovene. In Text, Speech and Dialogue, pp. 395402.CrossRefGoogle Scholar
Ljubešić, N., Erjavec, T., and Fišer, D. (2017). Adapting a state-of-the-art tagger for South Slavic languages to non-standard text. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 6068.CrossRefGoogle Scholar
Ljubešić, N., Klubička, F., Agić, Ž., and Jazbec, I.-P. (2016). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Tenth International Conference on Language Resources and Evaluation (LREC 2016).Google Scholar
Ljubešić, N., Zupan, K., Fišer, D., and Erjavec, T. (2016). Normalising Slovene data: historical texts vs. user-generated content. Bochumer Linguistische Arbeitsberichte, 146–155.Google Scholar
Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T., and Stark, E. (2018). Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 18–28). Santa Fe, New Mexico, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3902 Google Scholar
Matthews, D. (2007). Machine transliteration of proper names. Master’s Thesis, University of Edinburgh, Edinburgh.Google Scholar
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153157. Retrieved from https://doi.org/10.1007/BF02295996 CrossRefGoogle ScholarPubMed
Metzler, D., Dumais, S., and Meek, C. (2007). Similarity measures for short segments of text. In European Conference on Information Retrieval, pp. 1627.CrossRefGoogle Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., and Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT, pp. 380390.Google Scholar
Pettersson, E., Megyesi, B., and Nivre, J. (2013). Normalisation of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22–24, 2013; Oslo University; Norway. nealt Proceedings Series, Vol. 16, pp. 163179.Google Scholar
Pettersson, E., Megyesi, B., and Nivre, J. (2014). A multilingual evaluation of three spelling normalisation methods for historical text. In EACL 2014 Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, European Association for Computational Linguistics, LaTeCH 2014, pp. 3241.CrossRefGoogle Scholar
Pettersson, E., Megyesi, B., and Tiedemann, J. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013; May 22–24; 2013, Oslo, Norway: Nealt Proceedings Series, Vol. 18, pp. 5469.Google Scholar
Piotrowski, M. (2012). Natural Language Processing for Historical Texts . Synthesis Lectures on Human Language Technologies. Morgan & Claypool.Google Scholar
Plank, B., Søgaard, A., and Goldberg, Y. (2016). Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. arXiv preprint arXiv:1604.05529.Google Scholar
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pp. 133142. Retrieved from http://www.aclweb.org/anthology/W96-0213 Google Scholar
Rayson, P., Archer, D., Baron, A., Culpeper, J., and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics Conference: CL 2007. UCREL.Google Scholar
Ritter, A., Clark, S., and Etzioni, O. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 15241534.Google Scholar
Scheible, S., Whitt, R.J., Durrell, M., and Bennett, P. (2011). Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 1923.Google Scholar
Scheible, S., Whitt, R.J., Durrell, M., and Bennett, P. (2012). Gatetogermanc: A GATE-based annotation pipeline for historical German. In LREC, pp. 36113617.Google Scholar
Scherrer, Y. and Erjavec, T. (2016a). Modernising historical Slovene words. Natural Language Engineering 22(6), 881905.CrossRefGoogle Scholar
Scherrer, Y. and Erjavec, T. (2016b). Modernising historical Slovene words. Natural Language Engineering, FirstView, 1–25. Retrieved from http://journals.cambridge.org/article_S1351324915000236 Google Scholar
Scherrer, Y. and Ljubešić, N. (2016). Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation. Bochumer Linguistische Arbeitsberichte, 248255.Google Scholar
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK.Google Scholar
Silfverberg, M., Ruokolainen, T., Lindén, K., and Kurimo, M. (2014). Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pp. 259264. Retrieved from http://aclweb.org/anthology/P/P14/P14-2043.pdf CrossRefGoogle Scholar
Consortium, TEI (2017). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium. Retrieved from http://www.tei-c.org/Guidelines/P5/ Google Scholar
Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., Zervanou, K. (2017). The CLIN27 shared task : Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal 7, 5364. Retrieved from http://www.clinjournal.org/sites/clinjournal.org/files/04.clin27-shared-task.pdf Google Scholar
Vilar, D., Peter, J.-T., and Ney, H. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 3339.Google Scholar
Yang, Y. and Eisenstein, J. (2014). Unsupervised domain adaptation with feature embeddings. arXiv preprint arXiv:1412.4385.Google Scholar
Yang, Y. and Eisenstein, J. (2015). Unsupervised multi-domain adaptation with feature embeddings. In HLT-NAACL, pp. 672682.CrossRefGoogle Scholar
Yang, Y. and Eisenstein, J. (2016). Part-of-speech tagging for historical English. arXiv preprint arXiv:1603.03144.Google Scholar
Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shon, S., Glass, J., Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1–17, Santa Fe, New Mexico, USA: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-3901 Google Scholar