Neural morphosyntactic tagging for Rusyn

Yves Scherrer; Achim Rabus

doi:10.1017/S1351324919000287

Neural morphosyntactic tagging for Rusyn

Published online by Cambridge University Press: 18 July 2019

Yves Scherrer

and

Achim Rabus

Show author details

Yves Scherrer*: Affiliation:
Department of Digital Humanities, University of Helsinki, Unioninkatu 40, Helsinki, 00014, Finland
Achim Rabus: Affiliation:
Slavisches Seminar, Albert-Ludwigs-Universität Freiburg, Werthmannstr. 14, 79098 Freiburg, Germany
*: *Corresponding author. Email: yves.scherrer@helsinki.fi

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages. We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.

Keywords

Morphology Tagging Part-of-speech tagging Multilinguality Transfer learning Multitask learning

Information

Type: Article
Information: Natural Language Engineering , Volume 25 , Special Issue 5: Natural Language Processing for Similar Languages, Varieties and Dialects , September 2019 , pp. 633 - 650

DOI: https://doi.org/10.1017/S1351324919000287 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Agić Ž., Hovy D. and Søgaard, A. (2015). If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of ACL-IJCNLP 2015. Beijing, China, pp. 268–272.CrossRef Google Scholar

Al-Rfou, R., Perozzi, B. and Skiena, S. (2013). Polyglot: Distributed Word Representations for Multilingual NLP. In Proceedings of CoNLL 2013. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Sofia, Bulgaria, pp. 183–192.Google Scholar

Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C. and Smith, N. (2016). Many languages, one parser. Transactions of the Association for Computational Linguistics 4, 431–444.CrossRef Google Scholar

Artetxe, M., Labaka, G. and Agirre, E (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of ACL 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Vancouver, Canada, pp. 451–462.CrossRef Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146.CrossRef Google Scholar

Boudovskaia, E (2017). Past tense in the Rusyn dialect of Novoselycja: Auxiliary vs. subject pronoun as the first- and second-person subject. Journal of Slavic Linguistics 25(1), 3–62.CrossRef Google Scholar

Brants, T (2000). TnT - A Statistical Part-of-Speech Tagger. In Proceedings of ANLP 2000. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Seattle, Washington, USA, pp. 224–231.Google Scholar

Buys, J. and Botha, J.A. (2016). Cross-Lingual Morphological Tagging for Low-Resource Languages. In Proceedings of ACL 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Berlin, Germany. pp. 1954–1964.Google Scholar

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537.Google Scholar

Cotterell, R. and Heigold, G. (2017). Cross-lingual Character-Level Neural Morphological Tagging. In Proceedings of EMNLP 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Copenhagen, Denmark, pp. 748–759.Google Scholar

Feldman, A., Hana, J. and Brew, C (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC 2006. European Language Resources Association (ELRA), Paris, France. Genoa, Italy, pp. 549–554.Google Scholar

Johnson, M., Schuster, M., Le, Q., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M. and Dean, J (2017). Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, 339–351.CrossRef Google Scholar

Koporová, K. (ed). (2015). Rusyn’skŷj literaturnŷj jazŷk na Slovakiji: 20 rokiv kodifikaciji: Zbornyk referativ z IV. Midžinarodnoho kongresu rusyn’skoho jazŷka – Prjašiv, 2015. Prjašiv: Prjašivska univerzita v Prjašovi – Inštitut rusyn’skoho jazŷka i kulturŷ.Google Scholar

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C (2016). Neural Architectures for Namedntity Recognition. In Proceedings of NAACL-HLT 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. San Diego, California, pp. 260–270.Google Scholar

Lample, G., Conneau, A., Ranzato, M., Denoyer, L. and Jégou, H. (2018). Word translation without parallel data. In Proceedings of ICLR 2018. Vancouver, Canada.Google Scholar

Ljubešić, N., Erjavec, T. and Fišer, D (2017). Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Valencia, Spain, pp. 60–68.CrossRef Google Scholar

Magocsi, P.R. (ed). (2004). Rusyn’skŷj jazŷk. Najnowsze dzieje jçzyków słowiańskich. Opole: Uniw. Opolski Inst. Filologii Polskiej.Google Scholar

Malaviya, C., Gormley, M.R. and Neubig, G. (2018). Neural Factor Graph Models for Cross-lingual Morphological Tagging. In Proceedings of ACL 2018. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Melbourne, Australia, pp. 2653–2663.CrossRef Google Scholar

McDonald, R., Petrov, S. and Hall, K (2011). Multi-Source Transfer of Delexicalized Dependency Parsers. In Proceedings of EMNLP 2011. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Edinburgh, Scotland, UK, pp. 62–72.Google Scholar

Müller, T., Schmid, H. and Schütze, H (2013). Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of EMNLP 2013. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Seattle, Washington, USA, pp. 322–332.Google Scholar

Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W. Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., Duh, K., Faruqui, M., Gan, C., Garrette, D., Ji, Y., Kong, L., Kuncoro, A., Kumar, G., Malaviya, C., Michel, P., Oda, Y., Richardson, M., Saphra, N., Swayamdipta, S. and Yin, P (2015). DyNet: The Dynamic Neural Network Toolkit. arXiv preprint https://arXiv.org/abs/1701.03980arXiv:1701.03980.Google Scholar

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič J., Manning C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R. and Zeman, D. (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC 2016. Portorož, Slovenia, pp. 1659–1666.Google Scholar

Pinter, Y., Guthrie, R. and Eisenstein, J (2017). Mimicking Word Embeddings using Subword RNNs. In Proceedings of EMNLP 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Copenhagen, Denmark, pp. 102–112.CrossRef Google Scholar

Plank, B. and Agić Ž., (2018). Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging. In Proceedings of EMNLP 2018. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Brussels, Belgium, pp. 614–620.Google Scholar

Plank, B., Søgaard, A. and Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of ACL 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Berlin, Germany, pp. 412–418.CrossRef Google Scholar

Plishkova, A (2009). Language and national identity: Rusyns south of Carpathians. Classics of Carpatho-Rusyn scholarship. New York: Columbia University Press and East European Monographs. Google Scholar

Rabus, A (2015). Current developments in Carpatho-Rusyn speech – preliminary observations. In Krafcik, P.A. and Padjak, V.I. (eds), Juvilejnyj zbirnyk na čest’ profesora Pavla-Roberta Magočija. Užhorod, pp. 489–496.Google Scholar

Rabus, A (2019). Vergangenheitsbildung in gesprochenen karpatorussinischen Varietäten: Quantitativ-statistische Perspektiven. Die Welt der Slaven LXIX(1), 15–33.Google Scholar

Rabus, A. and Scherrer, Y (2017). Lexicon Induction for Spoken Rusyn – Challenges and Results. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain, pp. 27–32.CrossRef Google Scholar

Ruder, S., Vulić I, . and Søgaard, A (2018). A Survey of Cross-Lingual Word Embedding Models. Journal of Artificial Intelligence Research.Google Scholar

Scherrer, Y (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of VarDial 2014. Association for Computational Linguistics and Dublin City University, Stroudsburg, PA, USA, and Dublin, Ireland. Dublin, Ireland, pp. 30–38.CrossRef Google Scholar

Scherrer, Y. and Rabus, A. (2017). Multi-source morphosyntactic tagging for spoken Rusyn. In Proceedings of VarDial 2017. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Valencia, Spain, pp. 84–92.Google Scholar

Scherrer, Y., Mocken, S. and Rabus, A (2018). New Developments in Tagging Pre-modern Orthodox Slavic Texts. Scripta and e-Scripta 18, 9–33.Google Scholar

Sharoff, S (2018). Language adaptation experiments via cross-lingual embeddings for related languages. In Proceedings of LREC 2018. European Language Resources Association (ELRA), Paris, France. Miyazaki, Japan, pp. 844–849.Google Scholar

Skrypnyk, H.A. (ed). (2015). Ukrajinci-Rusyny: Etnolinhvistyčni ta etnokul’turni procesy v istoryčnomu rozvytku. Kyjiv: Instytut mystectvoznavstva, fol’klorystyky ta etnolohiji im. M.T. Ryl’s’koho.Google Scholar

Täckström, O., McDonald, R. and Uszkoreit, J. (2012). Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure. In Proceedings of NAACL-HLT 2012. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Montréal, Canada, pp. 477–487.Google Scholar

Yarowsky, D. and Ngai, G. (2001). Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In Proceedings of NAACL-HLT 2001. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Pittsburgh, PA, USA, pp. 200–207.Google Scholar

Yu, Z., Mareček, D., Žabokrtský, Z. and Zeman, D (2016). If You Even Don’t Have a Bit of Bible: Learning Delexicalized POS Taggers. In Proceedings of LREC 2016. European Language Resources Association (ELRA), Paris, France. pp. 96–103.Google Scholar

Zeman, D. and Resnik, P. (2008). Cross-Language Parser Adaptation between Related Languages. In Proceedings of the IJCNLP 2008 Workshop on NLP for Less Privileged Languages. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. Hyderabad, India, pp. 35–42.Google Scholar

Zhang, Y., Gaddy, D., Barzilay, R. and Jaakkola, T (2016). Ten Pairs to Tag – Multilingual POS Tagging via Coarse Mapping between Embeddings. In Proceedings of NAACL-HLT 2016. The Association for Computational Linguistics, Stroudsburg PA 18360, USA. San Diego, California, pp. 1307–1317.CrossRef Google Scholar

Article contents

Neural morphosyntactic tagging for Rusyn

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests