Skip to main content Accessibility help
×
Home

Neural morphosyntactic tagging for Rusyn

  • Yves Scherrer (a1) and Achim Rabus (a2)

Abstract

The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages. We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.

Copyright

Corresponding author

*Corresponding author. Email: yves.scherrer@helsinki.fi

References

Hide All
Agić Ž., Hovy D. and Søgaard, A. (2015). If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of ACL-IJCNLP 2015. Beijing, China, pp. 268272.
Al-Rfou, R., Perozzi, B. and Skiena, S. (2013). Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of CoNLL 2013. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Sofia, Bulgaria, pp. 183192.
Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C. and Smith, N. (2016). Many languages, one parser. Transactions of the Association for Computational Linguistics 4, 431444.
Artetxe, M., Labaka, G. and Agirre, E (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Vancouver, Canada, pp. 451462.
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135146.
Boudovskaia, E (2017). Past tense in the Rusyn dialect of Novoselycja: Auxiliary vs. subject pronoun as the first- and second-person subject. Journal of Slavic Linguistics 25(1), 362.
Brants, T (2000). TnT - A Statistical Part-of-Speech Tagger. In Proceedings of ANLP 2000. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Seattle, Washington, USA, pp. 224231.
Buys, J. and Botha, J.A. (2016). Cross-Lingual Morphological Tagging for Low-Resource Languages. In Proceedings of ACL 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Berlin, Germany. pp. 19541964.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 24932537.
Cotterell, R. and Heigold, G. (2017). Cross-lingual Character-Level Neural Morphological Tagging. In Proceedings of EMNLP 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Copenhagen, Denmark, pp. 748759.
Feldman, A., Hana, J. and Brew, C (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC 2006. European Language Resources Association (ELRA), Paris, France. Genoa, Italy, pp. 549554.
Johnson, M., Schuster, M., Le, Q., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M. and Dean, J (2017). Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, 339351.
Koporová, K. (ed). (2015). Rusyn’skŷj literaturnŷj jazŷk na Slovakiji: 20 rokiv kodifikaciji: Zbornyk referativ z IV. Midžinarodnoho kongresu rusyn’skoho jazŷka – Prjašiv, 2015. Prjašiv: Prjašivska univerzita v Prjašovi – Inštitut rusyn’skoho jazŷka i kulturŷ.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C (2016). Neural Architectures for Namedntity Recognition. In Proceedings of NAACL-HLT 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. San Diego, California, pp. 260270.
Lample, G., Conneau, A., Ranzato, M., Denoyer, L. and Jégou, H. (2018). Word translation without parallel data. In Proceedings of ICLR 2018. Vancouver, Canada.
Ljubešić, N., Erjavec, T. and Fišer, D (2017). Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Valencia, Spain, pp. 6068.
Magocsi, P.R. (ed). (2004). Rusyn’skŷj jazŷk. Najnowsze dzieje jçzyków słowiańskich. Opole: Uniw. Opolski Inst. Filologii Polskiej.
Malaviya, C., Gormley, M.R. and Neubig, G. (2018). Neural Factor Graph Models for Cross-lingual Morphological Tagging. In Proceedings of ACL 2018. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Melbourne, Australia, pp. 26532663.
McDonald, R., Petrov, S. and Hall, K (2011). Multi-Source Transfer of Delexicalized Dependency Parsers. In Proceedings of EMNLP 2011. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Edinburgh, Scotland, UK, pp. 6272.
Müller, T., Schmid, H. and Schütze, H (2013). Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of EMNLP 2013. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Seattle, Washington, USA, pp. 322332.
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W. Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan, C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., Saphra, N., Swayamdipta, S. and Yin, P (2015). DyNet: The Dynamic Neural Network Toolkit. arXiv preprint https://arXiv.org/abs/1701.03980arXiv:1701.03980.
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič J., Manning C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R. and Zeman, D. (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC 2016. Portorož, Slovenia, pp. 16591666.
Pinter, Y., Guthrie, R. and Eisenstein, J (2017). Mimicking Word Embeddings using Subword RNNs. In Proceedings of EMNLP 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Copenhagen, Denmark, pp. 102112.
Plank, B. and Agić Ž., (2018). Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging. In Proceedings of EMNLP 2018. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Brussels, Belgium, pp. 614620.
Plank, B., Søgaard, A. and Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of ACL 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Berlin, Germany, pp. 412418.
Plishkova, A (2009). Language and national identity: Rusyns south of Carpathians. Classics of Carpatho-Rusyn scholarship. New York: Columbia University Press and East European Monographs.
Rabus, A (2015). Current developments in Carpatho-Rusyn speech – preliminary observations. In Krafcik, P.A. and Padjak, V.I. (eds), Juvilejnyj zbirnyk na čest’ profesora Pavla-Roberta Magočija. Užhorod, pp. 489496.
Rabus, A (2019). Vergangenheitsbildung in gesprochenen karpatorussinischen Varietäten: Quantitativ-statistische Perspektiven. Die Welt der Slaven LXIX(1), 1533.
Rabus, A. and Scherrer, Y (2017). Lexicon Induction for Spoken Rusyn – Challenges and Results. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain, pp. 2732.
Ruder, S., Vulić I, . and Søgaard, A (2018). A Survey of Cross-Lingual Word Embedding Models. Journal of Artificial Intelligence Research.
Scherrer, Y (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of VarDial 2014. Association for Computational Linguistics and Dublin City University, Stroudsburg, PA, USA, and Dublin, Ireland. Dublin, Ireland, pp. 3038.
Scherrer, Y. and Rabus, A. (2017). Multi-source morphosyntactic tagging for spoken Rusyn. In Proceedings of VarDial 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Valencia, Spain, pp. 8492.
Scherrer, Y., Mocken, S. and Rabus, A (2018). New Developments in Tagging Pre-modern Orthodox Slavic Texts. Scripta and e-Scripta 18, 933.
Sharoff, S (2018). Language adaptation experiments via cross-lingual embeddings for related languages. In Proceedings of LREC 2018. European Language Resources Association (ELRA), Paris, France. Miyazaki, Japan, pp. 844849.
Skrypnyk, H.A. (ed). (2015). Ukrajinci-Rusyny: Etnolinhvistyčni ta etnokul’turni procesy v istoryčnomu rozvytku. Kyjiv: Instytut mystectvoznavstva, fol’klorystyky ta etnolohiji im. M.T. Ryl’s’koho.
Täckström, O., McDonald, R. and Uszkoreit, J. (2012). Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure. In Proceedings of NAACL-HLT 2012. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Montréal, Canada, pp. 477487.
Yarowsky, D. and Ngai, G. (2001). Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In Proceedings of NAACL-HLT 2001. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Pittsburgh, PA, USA, pp. 200207.
Yu, Z., Mareček, D., Žabokrtský, Z. and Zeman, D (2016). If You Even Don’t Have a Bit of Bible: Learning Delexicalized POS Taggers. In Proceedings of LREC 2016. European Language Resources Association (ELRA), Paris, France. pp. 96103.
Zeman, D. and Resnik, P. (2008). Cross-Language Parser Adaptation between Related Languages. In Proceedings of the IJCNLP 2008 Workshop on NLP for Less Privileged Languages. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Hyderabad, India, pp. 3542.
Zhang, Y., Gaddy, D., Barzilay, R. and Jaakkola, T (2016). Ten Pairs to Tag – Multilingual POS Tagging via Coarse Mapping between Embeddings. In Proceedings of NAACL-HLT 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. San Diego, California, pp. 13071317.

Keywords

Neural morphosyntactic tagging for Rusyn

  • Yves Scherrer (a1) and Achim Rabus (a2)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed