New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies†

MARCOS GARCIA; CARLOS GÓMEZ-RODRÍGUEZ; MIGUEL A. ALONSO

doi:10.1017/S1351324917000377

New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies†

Published online by Cambridge University Press: 06 October 2017

MARCOS GARCIA

CARLOS GÓMEZ-RODRÍGUEZ and

MIGUEL A. ALONSO

Show author details

MARCOS GARCIA: Affiliation:
LyS Group, Departamento de Letras, Facultade de Filoloxía, Universidade da Coruña, Campus de A Coruña, 15071 A Coruã, Galicia, Spain e-mail: marcos.garcia.gonzalez@udc.gal
CARLOS GÓMEZ-RODRÍGUEZ: Affiliation:
LyS Group, Departamento de Computación, Facultade de Informática, Universidade da Coruña, Campus de A Coruña, 15071 A Coruña, Galicia, Spain e-mail: carlos.gomez@udc.es, miguel.alonso@udc.es
MIGUEL A. ALONSO: Affiliation:
LyS Group, Departamento de Computación, Facultade de Informática, Universidade da Coruña, Campus de A Coruña, 15071 A Coruña, Galicia, Spain e-mail: carlos.gomez@udc.es, miguel.alonso@udc.es

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.

Information

Type: Articles
Information: Natural Language Engineering , Volume 24 , Issue 1 , January 2018 , pp. 91 - 122

DOI: https://doi.org/10.1017/S1351324917000377 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MICINN) through a Juan de la Cierva formación grant (FJCI-2014-22853), by the projects with references FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R (MINECO), and by the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement no. 714150 – FASTPARSE).

References

Agić, Ž., Hovy, D., and Søgaard, A. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference of the Asian Federation of Natural Language Processing, ACL-IJCNLP-2015. Short Papers, Beijing. Association for Computational Linguistics, pp. 268–72.Google Scholar

Agić, Ž., Johannsen, A., Plank, B., Martínez Alonso, H., Schluter, N., and Søgaard, A. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4: 301–12.Google Scholar

Agić, Ž., Tiedemann, J., Merkler, D., Krek, S., Dobrovoljc, K., and Moze, S. 2014. Cross-lingual dependency parsing of related languages with rich morphosyntactic tagsets. In Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, Doha. Association for Computational Linguistics, pp. 13–24.Google Scholar

Ammar, W., Mulcaire, G., Ballesteros, M., Dyer, C., and Smith, N. A., 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics 4: 431–44.Google Scholar

Aufrant, L., Wisniewski, G., and Yvon, F. 2016. Zero-resource dependency parsing: boosting delexicalized cross-lingual transfer with linguistic knowledge. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, COLING-2016, Osaka. Association for Computational Linguistics, pp. 119–30.Google Scholar

Bejček, E., Panevová, J., Popelka, J., Straňák, P., Ševčíková, M., Štěpánek, J., and Žabokrtskỳ, Z. 2012. Prague dependency treebank 2.5 – A revisited version of PDT 2.0. In Proceedings of the 24th International Conference on Computational Linguistics, COLING-2012, Bombay. Association for Computational Linguistics, pp. 231–46.Google Scholar

Berzak, Y., Huang, Y., Barbu, A., Korhonen, A., and Katz, B. 2016. Anchoring and agreement in syntactic annotations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP-2016, Austin. Association for Computational Linguistics, pp. 2215–24.Google Scholar

Cintra, L. F. L., and Cunha, C., 1984. Nova gramática do português contemporâneo. Lisbon: Livraria Sá da Costa.Google Scholar

Cohen, S. B., Das, D., and Smith, N. A. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh. Association for Computational Linguistics, pp. 50–61.Google Scholar

de Marneffe, M.-C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., and Manning, C. D., 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the 9th edition of the International Language Resources and Evaluation Conference, LREC-2014, vol. 14, Reykjavik. European Language Resources and Evaluation, pp. 4585–92.Google Scholar

de Marneffe, M.-C., MacCartney, B., and Manning, C. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th edition of the International Language Resources and Evaluation Conference, LREC-2006, vol. 6, Portorož. European Language Resources and Evaluation, pp. 449–54.Google Scholar

de Marneffe, M.-C., and Manning, C. D. 2008. The Stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, COLING-2008, Manchester. Association for Computational Linguistics, pp. 1–8.Google Scholar

Duong, L., Cohn, T., Bird, S., and Cook, P. 2015. A neural network model for low-resource universal dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 339–48.CrossRef Google Scholar

Durrett, G., Pauls, A., and Klein, D. 2012. Syntactic transfer using a bilingual lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL-2012, Jeju Island. Association for Computational Linguistics, pp. 1–11.Google Scholar

Erjavec, T., 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation 46 (1): 131–42.Google Scholar

Figueroa, T. V. 1997. Estruturas fonéticas de tres dialectos de Vigo. Verba (24): 313–32.Google Scholar

Freixeiro Mato, X. R., 2000. Gramática da lingua galega II. Morfosintaxe. Vigo: A Nosa Terra.Google Scholar

Ganchev, K., Gillenwater, J., and Taskar, B., 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP-2009, vol. 1, Singapore. Association for Computational Linguistics, pp. 369–77.Google Scholar

Garcia, M. 2016. Universal dependencies guidelines for the Galician-TreeGal treebank. Technical Report, LyS Group, Universidade da Coruña.Google Scholar

Garcia, M., and Gamallo, P. 2015. Yet another suite of multilingual NLP tools. In Languages, Applications and Technologies. Communications in Computer and Information Science, vol. 563, pp. 65–75. Switzerland: Springer.Google Scholar

Gimpel, K., and Smith, N. A., 2014. Phrase dependency machine translation with quasi-synchronous tree-to-tree features. Computational Linguistics 40 (2): 349–401.Google Scholar

Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing. Association for Computational Linguistics, pp. 1234–44.Google Scholar

Guo, J., Che, W., Yarowsky, D., Wang, H., and Liu, T. 2016. A representation learning framework for multi-source transfer parsing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI-2016, Phoenix. Association for the Advancement of Artificial Intelligence, pp. 2734–40.Google Scholar

Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., and Kolak, O., 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11 (03): 311–25.Google Scholar

Kendall, M. G., 1938. A new measure of rank correlation. Biometrika 30 (1/2): 81–93.CrossRef Google Scholar

Klein, D. and Manning, C. D. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL-2004, Barcelona. Association for Computational Linguistics, pp. 479–86.Google Scholar

Lacroix, O., Aufrant, L., Wisniewski, G., and Yvon, F. 2016a. Frustratingly easy cross-lingual transfer for transition-based dependency parsing. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2016, San Diego. Association for Computational Linguistics, pp. 1058–63.Google Scholar

Lacroix, O., Wisniewski, G., and Yvon, F. 2016b. Cross-lingual dependency transfer: what matters? Assessing the impact of pre- and post-processing. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP at the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2016, San Diego. Association for Computational Linguistics, pp. 20–9.Google Scholar

Malvar, P., Pichel, J. R., Senra, Ó., Gamallo, P., and Garcia, A., 2010. Vencendo a escassez de recursos computacionais. Carvalho: Tradutor Automático Estatístico Inglês-Galego a partir do corpus paralelo Europarl Inglês-Português. Linguamática 2 (2): 31–8.Google Scholar

McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K. B., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Bertomeu Castelló, N., and Lee, J. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL-2013, Sofia. Association for Computational Linguistics, pp. 92–7.Google Scholar

McDonald, R., Petrov, S., and Hall, K. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh. Association for Computational Linguistics, pp. 62–72.Google Scholar

McEnery, T. and Hardie, A., 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.Google Scholar

Muniz, M. C., Nunes, M. D. G. V., and Laporte, E. 2005. UNITEX-PB, a set of flexible language resources for Brazilian Portuguese. In Proceedings of the Workshop on Technology on Information and Human Language, TIL, São Leopoldo. Sociedade Brasileira de Computação, pp. 2059–68.Google Scholar

Naseem, T., Barzilay, R., and Globerson, A. 2012. Selective sharing for multilingual dependency parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), ACL-2012, Jeju Island. Association for Computational Linguistics, pp. 629–37.Google Scholar

Nguyen, T.-V. T., Moschitti, A., and Riccardi, G., 2009. Convolution kernels on constituent, dependency and sequential structures for relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, vol. 3, Singapore. Association for Computational Linguistics, pp. 1378–87.Google Scholar

Nivre, J. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, Barcelona. Association for Computational Linguistics, pp. 50–7.Google Scholar

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th edition of the International Language Resources and Evaluation Conference, LREC-2016, Portorož. European Language Resources and Evaluation, pp. 1659–66.Google Scholar

Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., Marinov, S., and Marsi, E., 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13 (02): 95–135.CrossRef Google Scholar

Padró, L., and Stanilovsky, E. 2012. Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th edition of the International Language Resources and Evaluation Conference, LREC-2012, Istambul. European Language Resources and Evaluation, pp. 2473–9.Google Scholar

Petrov, S., Das, D., and McDonald, R., 2012. A universal part-of-speech tagset. In Proceedings of the 8th edition of the International Language Resources and Evaluation Conference, LREC-2012, Istambul, European Language Resources and Evaluation, pp. 2089–96.Google Scholar

Rasooli, M. S., and Collins, M. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 328–38.Google Scholar

Rojo, G., Martínez, M. L., Noya, E. D., and Barcala, F. M. 2015. Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA), Versión 2.6. http://corpus.cirp.es/xiada/corpus_xiada_2_6.tar.gz. Centro Ramón Piñeiro para a Investigación en Humanidades.Google Scholar

Rosa, R., Masek, J., Marecek, D., Popel, M., Zeman, D., and Zabokrtskỳ, Z. 2014. HamleDT 2.0: Thirty Dependency Treebanks Stanfordized. In Proceedings of the 9th edition of the International Language Resources and Evaluation Conference, LREC-2014, Reykjavik. European Language Resources and Evaluation, pp. 2334–41.Google Scholar

Rosa, R., and Žabokrtskỳ, Z. 2015. KLcpos3 - a language similarity measure for delexicalized parser transfer. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing. Association for Computational Linguistics, pp. 243–9.Google Scholar

Rosa, R., and Žabokrtskỳ, Z. 2015b. MSTParser Model interpolation for multi-source delexicalized transfer. In Proceedings of the 14th International Conference on Parsing Technologies, Bilbao. Association for Computational Linguistics, pp. 71–5.CrossRef Google Scholar

Smith, D. A., and Eisner, J., 2009. Parser adaptation and projection with quasi-synchronous grammar features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2009, vol. 2, Singapore. Association for Computational Linguistics, pp. 822–31.Google Scholar

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-2013, Seattle. Association for Computational Linguistics, pp. 1631–42.Google Scholar

Søgaard, A., 2011. Data point selection for cross-language adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, ACL HLT-2011, vol. 22, Portland. Association for Computational Linguistics, pp. 682–6.Google Scholar

Søgaard, A., Agić, Ž., Martínez Alonso, H., Plank, B., Bohnet, B., and Johannsen, A. 2015. Inverted indexing for cross-lingual NLP. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing. Association for Computational Linguistics, pp. 1713–22.Google Scholar

Straka, M., Hajič, J., and Straková, J., 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC-2016, Portoroz, European Language Resources Association, pp. 4290–7.Google Scholar

Täckström, O., McDonald, R., and Uszkoreit, J., 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT-2012, Montreal, Association for Computational Linguistics, pp. 477–87.Google Scholar

Täckström, O., McDonald, R., and Nivre, J. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2013. Atlanta, Association for Computational Linguistics, pp. 1061–71.Google Scholar

Teyssier, P., 1982. História da língua portuguesa. Lisbon: Livraria Sá da Costa.Google Scholar

Tiedemann, J. 2014. Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of the 25th International Conference on Computational Linguistics, COLING-2014, Dublin. Association for Computational Linguistics, pp. 1854–64.Google Scholar

Tiedemann, J., 2015a. Improving the cross-lingual projection of syntactic dependencies. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA-2015, vol. 109, Vilnius. Linköping University Electronic Press, pp. 191–9.Google Scholar

Tiedemann, J. 2015b. Cross-lingual dependency parsing with universal dependencies and predicted PoS labels. In Proceedings of the 3rd International Conference on Dependency Linguistics, Depling-2015), Uppsala. Association for Computational Linguistics, pp. 340–9.Google Scholar

Tiedemann, J., and Agić, Ž., 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55: 209–48.Google Scholar

Tiedemann, J., Agić, Ž., and Nivre, J. 2014. Treebank translation for cross-lingual parser induction. In Proceedings of the 18th Conference on Computational Natural Language Learning, CoNLL-2014, Baltimore. Association for Computational Linguistics, pp. 130–40.Google Scholar

Tsarfaty, R. 2013. A Unified Morpho-syntactic scheme of stanford dependencies. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL-2013, Sofia. Association for Computational Linguistics, pp. 578–84.Google Scholar

Vilares, D., Alonso, M. A., and Gómez-Rodríguez, C. 2016. One model, two languages: training bilingual parsers with harmonized treebanks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL-2016, Berlin: Association for Computational Linguistics, pp. 425–31.Google Scholar

Xunta de Galicia (AA.VV). 2004. Plan xeral de normalización da lingua galega. Xunta de Galicia, Consellería de Educación e Ordenación Universitaria, Dirección Xeral de Política Lingüística.Google Scholar

Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research, HLT-2001, San Diego. Association for Computational Linguistics, pp. 1–8.Google Scholar

Zeman, D. 2008. Reusable tagset conversion using tagset drivers. In Proceedings of the 6th edition of the International Language Resources and Evaluation Conference, LREC-2008, Marrakech. European Language Resources and Evaluation, pp. 213–18.Google Scholar

Zeman, D., Dušek, O., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtskỳ, Z., and Hajič, J., 2014. HamleDT: Harmonized multi-language dependency treebank. Language Resources and Evaluation 48 (4): 601–37.Google Scholar

Zeman, D. and Resnik, P. 2008. Cross-language parser adaptation between related languages. In Proceedings of the Workshop on NLP for Less Privileged Language at the 3rd International Joint Conference on Natural Language Processing, IJCNLP-2008, Hyderabad. Asian Federation of Natural Language Processing, pp. 35–42.Google Scholar

Zhang, Y., and Barzilay, R. 2015. Hierarchical low-rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP-2015, Lisbon. Association for Computational Linguistics, pp. 1857–67.Google Scholar

Article contents

New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies†

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests