Skip to main content
×
Home

New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies

  • MARCOS GARCIA (a1), CARLOS GÓMEZ-RODRÍGUEZ (a2) and MIGUEL A. ALONSO (a2)
Abstract
Abstract

This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.

Copyright
Footnotes
Hide All

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MICINN) through a Juan de la Cierva formación grant (FJCI-2014-22853), by the projects with references FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R (MINECO), and by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement no. 714150 – FASTPARSE).

Footnotes
References
Hide All
Agić Ž., Hovy D., and Søgaard A. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing, ACL-IJCNLP-2015. Short Papers, Beijing. Association for Computational Linguistics, pp. 268–72.
Agić Ž., Johannsen A., Plank B., Martínez Alonso H., Schluter N., and Søgaard A. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4: 301–12.
Agić Ž., Tiedemann J., Merkler D., Krek S., Dobrovoljc K., and Moze S. 2014. Cross-lingual dependency parsing of related languages with rich morphosyntactic tagsets. In Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, Doha. Association for Computational Linguistics, pp. 1324.
Ammar W., Mulcaire G., Ballesteros M., Dyer C., and Smith N. A., 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics 4: 431–44.
Aufrant L., Wisniewski G., and Yvon F. 2016. Zero-resource dependency parsing: boosting delexicalized cross-lingual transfer with linguistic knowledge. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, COLING-2016, Osaka. Association for Computational Linguistics, pp. 119–30.
Bejček E., Panevová J., Popelka J., Straňák P., Ševčíková M., Štěpánek J., and Žabokrtskỳ Z. 2012. Prague dependency treebank 2.5 – A revisited version of PDT 2.0. In Proceedings of the 24th International Conference on Computational Linguistics, COLING-2012, Bombay. Association for Computational Linguistics, pp. 231–46.
Berzak Y., Huang Y., Barbu A., Korhonen A., and Katz B. 2016. Anchoring and agreement in syntactic annotations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP-2016, Austin. Association for Computational Linguistics, pp. 2215–24.
Cintra L. F. L., and Cunha C., 1984. Nova gramática do português contemporâneo. Lisbon: Livraria Sá da Costa.
Cohen S. B., Das D., and Smith N. A. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh. Association for Computational Linguistics, pp. 5061.
de Marneffe M.-C., Dozat T., Silveira N., Haverinen K., Ginter F., Nivre J., and Manning C. D., 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the 9th edition of the International Language Resources and Evaluation Conference, LREC-2014, vol. 14, Reykjavik. European Language Resources and Evaluation, pp. 4585–92.
de Marneffe M.-C., MacCartney B., and Manning C. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th edition of the International Language Resources and Evaluation Conference, LREC-2006, vol. 6, Portorož. European Language Resources and Evaluation, pp. 449–54.
de Marneffe M.-C., and Manning C. D. 2008. The Stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, COLING-2008, Manchester. Association for Computational Linguistics, pp. 18.
Duong L., Cohn T., Bird S., and Cook P. 2015. A neural network model for low-resource universal dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 339–48.
Durrett G., Pauls A., and Klein D. 2012. Syntactic transfer using a bilingual lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL-2012, Jeju Island. Association for Computational Linguistics, pp. 111.
Erjavec T., 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation 46 (1): 131–42.
Figueroa T. V. 1997. Estruturas fonéticas de tres dialectos de Vigo. Verba (24): 313–32.
Freixeiro Mato X. R., 2000. Gramática da lingua galega II. Morfosintaxe. Vigo: A Nosa Terra.
Ganchev K., Gillenwater J., and Taskar B., 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP-2009, vol. 1, Singapore. Association for Computational Linguistics, pp. 369–77.
Garcia M. 2016. Universal dependencies guidelines for the Galician-TreeGal treebank. Technical Report, LyS Group, Universidade da Coruña.
Garcia M., and Gamallo P. 2015. Yet another suite of multilingual NLP tools. In Languages, Applications and Technologies. Communications in Computer and Information Science, vol. 563, pp. 6575. Switzerland: Springer.
Gimpel K., and Smith N. A., 2014. Phrase dependency machine translation with quasi-synchronous tree-to-tree features. Computational Linguistics 40 (2): 349401.
Guo J., Che W., Yarowsky D., Wang H., and Liu T. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing. Association for Computational Linguistics, pp. 1234–44.
Guo J., Che W., Yarowsky D., Wang H., and Liu T. 2016. A representation learning framework for multi-source transfer parsing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI-2016, Phoenix. Association for the Advancement of Artificial Intelligence, pp. 2734–40.
Hwa R., Resnik P., Weinberg A., Cabezas C., and Kolak O., 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11 (03): 311–25.
Kendall M. G., 1938. A new measure of rank correlation. Biometrika 30 (1/2): 8193.
Klein D. and Manning C. D. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL-2004, Barcelona. Association for Computational Linguistics, pp. 479–86.
Lacroix O., Aufrant L., Wisniewski G., and Yvon F. 2016a. Frustratingly easy cross-lingual transfer for transition-based dependency parsing. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2016, San Diego. Association for Computational Linguistics, pp. 1058–63.
Lacroix O., Wisniewski G., and Yvon F. 2016b. Cross-lingual dependency transfer: what matters? Assessing the impact of pre- and post-processing. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP at the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2016, San Diego. Association for Computational Linguistics, pp. 20–9.
Malvar P., Pichel J. R., Senra Ó., Gamallo P., and Garcia A., 2010. Vencendo a escassez de recursos computacionais. Carvalho: Tradutor Automático Estatístico Inglês-Galego a partir do corpus paralelo Europarl Inglês-Português. Linguamática 2 (2): 31–8.
McDonald R., Nivre J., Quirmbach-Brundage Y., Goldberg Y., Das D., Ganchev K., Hall K. B., Petrov S., Zhang H., Täckström O., Bedini C., Bertomeu Castelló N., and Lee J. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL-2013, Sofia. Association for Computational Linguistics, pp. 92–7.
McDonald R., Petrov S., and Hall K. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh. Association for Computational Linguistics, pp. 6272.
McEnery T. and Hardie A., 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.
Muniz M. C., Nunes M. D. G. V., and Laporte E. 2005. UNITEX-PB, a set of flexible language resources for Brazilian Portuguese. In Proceedings of the Workshop on Technology on Information and Human Language, TIL, São Leopoldo. Sociedade Brasileira de Computação, pp. 2059–68.
Naseem T., Barzilay R., and Globerson A. 2012. Selective sharing for multilingual dependency parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), ACL-2012, Jeju Island. Association for Computational Linguistics, pp. 629–37.
Nguyen T.-V. T., Moschitti A., and Riccardi G., 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, vol. 3, Singapore. Association for Computational Linguistics, pp. 1378–87.
Nivre J. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, Barcelona. Association for Computational Linguistics, pp. 50–7.
Nivre J., de Marneffe M.-C., Ginter F., Goldberg Y., Hajič J., Manning C. D., McDonald R., Petrov S., Pyysalo S., Silveira N., Tsarfaty R., and Zeman D. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th edition of the International Language Resources and Evaluation Conference, LREC-2016, Portorož. European Language Resources and Evaluation, pp. 1659–66.
Nivre J., Hall J., Nilsson J., Chanev A., Eryigit G., Kübler S., Marinov S., and Marsi E., 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13 (02): 95135.
Padró L., and Stanilovsky E. 2012. Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th edition of the International Language Resources and Evaluation Conference, LREC-2012, Istambul. European Language Resources and Evaluation, pp. 2473–9.
Petrov S., Das D., and McDonald R., 2012. A universal part-of-speech tagset. In Proceedings of the 8th edition of the International Language Resources and Evaluation Conference, LREC-2012, Istambul, European Language Resources and Evaluation, pp. 2089–96.
Rasooli M. S., and Collins M. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 328–38.
Rojo G., Martínez M. L., Noya E. D., and Barcala F. M. 2015. Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA), Versión 2.6. http://corpus.cirp.es/xiada/corpus_xiada_2_6.tar.gz. Centro Ramón Piñeiro para a Investigación en Humanidades.
Rosa R., Masek J., Marecek D., Popel M., Zeman D., and Zabokrtskỳ Z. 2014. HamleDT 2.0: Thirty Dependency Treebanks Stanfordized. In Proceedings of the 9th edition of the International Language Resources and Evaluation Conference, LREC-2014, Reykjavik. European Language Resources and Evaluation, pp. 2334–41.
Rosa R., and Žabokrtskỳ Z. 2015. KLcpos3 - a language similarity measure for delexicalized parser transfer. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing. Association for Computational Linguistics, pp. 243–9.
Rosa R., and Žabokrtskỳ Z. 2015b. MSTParser Model interpolation for multi-source delexicalized transfer. In Proceedings of the 14th International Conference on Parsing Technologies, Bilbao. Association for Computational Linguistics, pp. 71–5.
Smith D. A., and Eisner J., 2009. Parser adaptation and projection with quasi-synchronous grammar features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, vol. 2, Singapore. Association for Computational Linguistics, pp. 822–31.
Socher R., Perelygin A., Wu J. Y., Chuang J., Manning C. D., Ng A. Y., and Potts C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2013, Seattle. Association for Computational Linguistics, pp. 1631–42.
Søgaard A., 2011. Data point selection for cross-language adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, ACL HLT-2011, vol. 22, Portland. Association for Computational Linguistics, pp. 682–6.
Søgaard A., Agić Ž., Martínez Alonso H., Plank B., Bohnet B., and Johannsen A. 2015. Inverted indexing for cross-lingual NLP. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing. Association for Computational Linguistics, pp. 1713–22.
Straka M., Hajič J., and Straková J., 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC-2016, Portoroz, European Language Resources Association, pp. 4290–7.
Täckström O., McDonald R., and Uszkoreit J., 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2012, Montreal, Association for Computational Linguistics, pp. 477–87.
Täckström O., McDonald R., and Nivre J. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2013. Atlanta, Association for Computational Linguistics, pp. 1061–71.
Teyssier P., 1982. História da língua portuguesa. Lisbon: Livraria Sá da Costa.
Tiedemann J. 2014. Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of the 25th International Conference on Computational Linguistics, COLING-2014, Dublin. Association for Computational Linguistics, pp. 1854–64.
Tiedemann J., 2015a. Improving the cross-lingual projection of syntactic dependencies. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA-2015, vol. 109, Vilnius. Linköping University Electronic Press, pp. 191–9.
Tiedemann J. 2015b. Cross-lingual dependency parsing with universal dependencies and predicted PoS labels. In Proceedings of the 3rd International Conference on Dependency Linguistics, Depling-2015), Uppsala. Association for Computational Linguistics, pp. 340–9.
Tiedemann J., and Agić Ž., 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55: 209–48.
Tiedemann J., Agić Ž., and Nivre J. 2014. Treebank translation for cross-lingual parser induction. In Proceedings of the 18th Conference on Computational Natural Language Learning, CoNLL-2014, Baltimore. Association for Computational Linguistics, pp. 130–40.
Tsarfaty R. 2013. A Unified Morpho-syntactic scheme of stanford dependencies. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL-2013, Sofia. Association for Computational Linguistics, pp. 578–84.
Vilares D., Alonso M. A., and Gómez-Rodríguez C. 2016. One model, two languages: training bilingual parsers with harmonized treebanks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL-2016, Berlin: Association for Computational Linguistics, pp. 425–31.
Xunta de Galicia (AA.VV). 2004. Plan xeral de normalización da lingua galega. Xunta de Galicia, Consellería de Educación e Ordenación Universitaria, Dirección Xeral de Política Lingüística.
Yarowsky D., Ngai G., and Wicentowski R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research, HLT-2001, San Diego. Association for Computational Linguistics, pp. 1–8.
Zeman D. 2008. Reusable tagset conversion using tagset drivers. In Proceedings of the 6th edition of the International Language Resources and Evaluation Conference, LREC-2008, Marrakech. European Language Resources and Evaluation, pp. 213–18.
Zeman D., Dušek O., Mareček D., Popel M., Ramasamy L., Štěpánek J., Žabokrtskỳ Z., and Hajič J., 2014. HamleDT: Harmonized multi-language dependency treebank. Language Resources and Evaluation 48 (4): 601–37.
Zeman D. and Resnik P. 2008. Cross-language parser adaptation between related languages. In Proceedings of the Workshop on NLP for Less Privileged Language at the 3rd International Joint Conference on Natural Language Processing, IJCNLP-2008, Hyderabad. Asian Federation of Natural Language Processing, pp. 3542.
Zhang Y., and Barzilay R. 2015. Hierarchical low-rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 1857–67.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 1
Total number of PDF views: 26 *
Loading metrics...

Abstract views

Total abstract views: 193 *
Loading metrics...

* Views captured on Cambridge Core between 6th October 2017 - 19th November 2017. This data will be updated every 24 hours.