Hostname: page-component-586b7cd67f-t8hqh Total loading time: 0 Render date: 2024-12-04T04:45:50.408Z Has data issue: false hasContentIssue false

Neural morphosyntactic tagging for Rusyn

Published online by Cambridge University Press:  18 July 2019

Yves Scherrer*
Affiliation:
Department of Digital Humanities, University of Helsinki, Unioninkatu 40, Helsinki, 00014, Finland
Achim Rabus
Affiliation:
Slavisches Seminar, Albert-Ludwigs-Universität Freiburg, Werthmannstr. 14, 79098 Freiburg, Germany
*
*Corresponding author. Email: yves.scherrer@helsinki.fi

Abstract

The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages. We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agić Ž., Hovy D. and Søgaard, A. (2015). If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of ACL-IJCNLP 2015. Beijing, China, pp. 268272.CrossRefGoogle Scholar
Al-Rfou, R., Perozzi, B. and Skiena, S. (2013). Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of CoNLL 2013. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Sofia, Bulgaria, pp. 183192.Google Scholar
Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C. and Smith, N. (2016). Many languages, one parser. Transactions of the Association for Computational Linguistics 4, 431444.CrossRefGoogle Scholar
Artetxe, M., Labaka, G. and Agirre, E (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Vancouver, Canada, pp. 451462.CrossRefGoogle Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135146.CrossRefGoogle Scholar
Boudovskaia, E (2017). Past tense in the Rusyn dialect of Novoselycja: Auxiliary vs. subject pronoun as the first- and second-person subject. Journal of Slavic Linguistics 25(1), 362.CrossRefGoogle Scholar
Brants, T (2000). TnT - A Statistical Part-of-Speech Tagger. In Proceedings of ANLP 2000. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Seattle, Washington, USA, pp. 224231.Google Scholar
Buys, J. and Botha, J.A. (2016). Cross-Lingual Morphological Tagging for Low-Resource Languages. In Proceedings of ACL 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Berlin, Germany. pp. 19541964.Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.Google Scholar
Cotterell, R. and Heigold, G. (2017). Cross-lingual Character-Level Neural Morphological Tagging. In Proceedings of EMNLP 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Copenhagen, Denmark, pp. 748759.Google Scholar
Feldman, A., Hana, J. and Brew, C (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC 2006. European Language Resources Association (ELRA), Paris, France. Genoa, Italy, pp. 549554.Google Scholar
Johnson, M., Schuster, M., Le, Q., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M. and Dean, J (2017). Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, 339351.CrossRefGoogle Scholar
Koporová, K. (ed). (2015). Rusyn’skŷj literaturnŷj jazŷk na Slovakiji: 20 rokiv kodifikaciji: Zbornyk referativ z IV. Midžinarodnoho kongresu rusyn’skoho jazŷka – Prjašiv, 2015. Prjašiv: Prjašivska univerzita v Prjašovi – Inštitut rusyn’skoho jazŷka i kulturŷ.Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C (2016). Neural Architectures for Namedntity Recognition. In Proceedings of NAACL-HLT 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. San Diego, California, pp. 260270.Google Scholar
Lample, G., Conneau, A., Ranzato, M., Denoyer, L. and Jégou, H. (2018). Word translation without parallel data. In Proceedings of ICLR 2018. Vancouver, Canada.Google Scholar
Ljubešić, N., Erjavec, T. and Fišer, D (2017). Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Valencia, Spain, pp. 6068.CrossRefGoogle Scholar
Magocsi, P.R. (ed). (2004). Rusyn’skŷj jazŷk. Najnowsze dzieje jçzyków słowiańskich. Opole: Uniw. Opolski Inst. Filologii Polskiej.Google Scholar
Malaviya, C., Gormley, M.R. and Neubig, G. (2018). Neural Factor Graph Models for Cross-lingual Morphological Tagging. In Proceedings of ACL 2018. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Melbourne, Australia, pp. 26532663.CrossRefGoogle Scholar
McDonald, R., Petrov, S. and Hall, K (2011). Multi-Source Transfer of Delexicalized Dependency Parsers. In Proceedings of EMNLP 2011. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Edinburgh, Scotland, UK, pp. 6272.Google Scholar
Müller, T., Schmid, H. and Schütze, H (2013). Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of EMNLP 2013. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Seattle, Washington, USA, pp. 322332.Google Scholar
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W. Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan, C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., Saphra, N., Swayamdipta, S. and Yin, P (2015). DyNet: The Dynamic Neural Network Toolkit. arXiv preprint https://arXiv.org/abs/1701.03980arXiv:1701.03980.Google Scholar
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič J., Manning C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R. and Zeman, D. (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC 2016. Portorož, Slovenia, pp. 16591666.Google Scholar
Pinter, Y., Guthrie, R. and Eisenstein, J (2017). Mimicking Word Embeddings using Subword RNNs. In Proceedings of EMNLP 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Copenhagen, Denmark, pp. 102112.CrossRefGoogle Scholar
Plank, B. and Agić Ž., (2018). Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging. In Proceedings of EMNLP 2018. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Brussels, Belgium, pp. 614620.Google Scholar
Plank, B., Søgaard, A. and Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of ACL 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Berlin, Germany, pp. 412418.CrossRefGoogle Scholar
Plishkova, A (2009). Language and national identity: Rusyns south of Carpathians. Classics of Carpatho-Rusyn scholarship. New York: Columbia University Press and East European Monographs. Google Scholar
Rabus, A (2015). Current developments in Carpatho-Rusyn speech – preliminary observations. In Krafcik, P.A. and Padjak, V.I. (eds), Juvilejnyj zbirnyk na čest’ profesora Pavla-Roberta Magočija. Užhorod, pp. 489496.Google Scholar
Rabus, A (2019). Vergangenheitsbildung in gesprochenen karpatorussinischen Varietäten: Quantitativ-statistische Perspektiven. Die Welt der Slaven LXIX(1), 1533.Google Scholar
Rabus, A. and Scherrer, Y (2017). Lexicon Induction for Spoken Rusyn – Challenges and Results. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain, pp. 2732.CrossRefGoogle Scholar
Ruder, S., Vulić I, . and Søgaard, A (2018). A Survey of Cross-Lingual Word Embedding Models. Journal of Artificial Intelligence Research.Google Scholar
Scherrer, Y (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of VarDial 2014. Association for Computational Linguistics and Dublin City University, Stroudsburg, PA, USA, and Dublin, Ireland. Dublin, Ireland, pp. 3038.CrossRefGoogle Scholar
Scherrer, Y. and Rabus, A. (2017). Multi-source morphosyntactic tagging for spoken Rusyn. In Proceedings of VarDial 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Valencia, Spain, pp. 8492.Google Scholar
Scherrer, Y., Mocken, S. and Rabus, A (2018). New Developments in Tagging Pre-modern Orthodox Slavic Texts. Scripta and e-Scripta 18, 933.Google Scholar
Sharoff, S (2018). Language adaptation experiments via cross-lingual embeddings for related languages. In Proceedings of LREC 2018. European Language Resources Association (ELRA), Paris, France. Miyazaki, Japan, pp. 844849.Google Scholar
Skrypnyk, H.A. (ed). (2015). Ukrajinci-Rusyny: Etnolinhvistyčni ta etnokul’turni procesy v istoryčnomu rozvytku. Kyjiv: Instytut mystectvoznavstva, fol’klorystyky ta etnolohiji im. M.T. Ryl’s’koho.Google Scholar
Täckström, O., McDonald, R. and Uszkoreit, J. (2012). Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure. In Proceedings of NAACL-HLT 2012. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Montréal, Canada, pp. 477487.Google Scholar
Yarowsky, D. and Ngai, G. (2001). Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In Proceedings of NAACL-HLT 2001. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Pittsburgh, PA, USA, pp. 200207.Google Scholar
Yu, Z., Mareček, D., Žabokrtský, Z. and Zeman, D (2016). If You Even Don’t Have a Bit of Bible: Learning Delexicalized POS Taggers. In Proceedings of LREC 2016. European Language Resources Association (ELRA), Paris, France. pp. 96103.Google Scholar
Zeman, D. and Resnik, P. (2008). Cross-Language Parser Adaptation between Related Languages. In Proceedings of the IJCNLP 2008 Workshop on NLP for Less Privileged Languages. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Hyderabad, India, pp. 3542.Google Scholar
Zhang, Y., Gaddy, D., Barzilay, R. and Jaakkola, T (2016). Ten Pairs to Tag – Multilingual POS Tagging via Coarse Mapping between Embeddings. In Proceedings of NAACL-HLT 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. San Diego, California, pp. 13071317.CrossRefGoogle Scholar