Skip to main content
×
×
Home

A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages

  • O. ZENNAKI (a1) (a2), N. SEMMAR (a1) and L. BESACIER (a2)
Abstract

This work focuses on the rapid development of linguistic annotation tools for low-resource languages (languages that have no labeled training data). We experiment with several cross-lingual annotation projection methods using recurrent neural networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between source and target languages. More precisely, our approach has the following characteristics: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger for N languages). We investigate both uni and bidirectional RNN models and propose a method to include external information (for instance, low-level information from part-of-speech tags) in the RNN to train higher level taggers (for instance, Super Sense taggers). We demonstrate the validity and genericity of our model by using parallel corpora (obtained by manual or automatic translation). Our experiments are conducted to induce cross-lingual part-of-speech and Super Sense taggers. We also use our approach in a weakly supervised context, and it shows an excellent potential for very low-resource settings (less than 1k training utterances).

Copyright
References
Hide All
Al-Rfou, R., Perozzi, B., and Skiena, S. 2013. Polyglot: distributed word representations for multilingual nlp. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning, pp. 183–192.
Annesi, P., and Basili, R. 2010. Cross-lingual alignment of FrameNet annotations through Hidden Markov Models. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Springer, Berlin, Heidelberg, pp. 12–25.
Aufrant, L., Wisniewski, G., and Yvon, F. 2016. Zero-resource dependency parsing: boosting delexicalized cross-lingual transfer with linguistic knowledge. In Proceedings of the 26th International Conference on Computational Linguistics, pp. 119–130.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, 11371155.
Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., and Gauvain, J.-L. 2006. Neural probabilistic language models. In Dawn E, H.. and C, J. Lakhmi. (eds.), Innovations in Machine Learning, pp. 137186. Berlin, Heidelberg: Springer.
Bentivogli, L., Forner, P., and Pianta, E. 2004. Evaluating cross-language annotation transfer in the multisemcor corpus. In Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics, pp. 364–371.
Bérard, A., Servan, C., Pietquin, O, and Besacier, L. 2016. MultiVec: a multilingual and multilevel representation learning toolkit for NLP. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference, pp. 4188–4192.
Besacier, L., Barnard, E., Karpov, A., and Schultz, T., 2014. Automatic speech recognition for under-resourced languages: a survey. Speech Communication 56: 85100.
Besacier, L., Lecouteux, B., Azouzi, M., and Luong, N.-Q. 2012. The LIG English to French machine translation system for IWSLT 2012. In Proceedings of the 9th International Workshop on Spoken Language Translation, pp. 102–108.
Brants, T. 2000. TnT: a statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing, Association for Computational Linguistics, pp. 224–231.
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L., 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19: 263311.
Buchholz, S., and Marsi, E. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the 10th Conference on Computational Natural Language Learning, Association for Computational Linguistics, pp. 149–164.
Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. 2014. On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of the Syntax, Semantics and Structure in Statistical Translation, pp. 103–111.
Ciaramita, M., and Altun, Y. 2006. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 594–602.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P., 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12: 24932537.
Das, D., and Petrov, S., 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, Association for Computational Linguistics, pp. 600609.
Duong, L., Cook, P., Bird, S., and Pecina, P. 2013. Simpler unsupervised POS tagging with bilingual projections. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 2, pp. 634–639.
Durrett, G., Pauls, A., and Klein, D. 2012. Syntactic transfer using a bilingual lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, pp. 1–11.
Elman, J. L., 1990. Finding structure in time. Cognitive science 14: 179211.
Fellbaum, C., 1998. WordNet. Wiley Online Library, Cambridge, MA: MIT Press.
Fraser, A., and Marcu, D., 2007. Measuring word alignment quality for statistical machine translation. Computational Linguistics 33: 293303.
Garside, R., Leech, G. N., and McEnery, T. 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. Taylor & Francis, Abingdon.
Gouws, S., and Søgaard, A. 2015. Simple task-specific bilingual word embeddings. In Proceedings of the 14th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1386–1390.
Gouws, S., Bengio, Y., and Corrado, G. 2015. BilBOWA: fast bilingual distributed representations without word alignments. In Proceedings of the 32nd International Conference on Machine Learning, pp. 748–756.
Graves, A. 2012. Supervised sequence labelling. In Supervised Sequence Labelling with Recurrent Neural Networks, pp. 513. Berlin, Heidelberg: Springer.
Gutiérrez Vázquez, Y., Fernández Orquín, A., Montoyo Guijarro, A., Vázquez Pérez, S. 2011. Enriching the Integration of Semantic Resources Based on Wordnet. Sociedad Española para el Procesamiento del Lenguaje Natural, 47: 249257, Huelva, Spain.
Henderson, J. 2004. Discriminative training of a neural network statistical parser. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 95–102.
Jiang, W., Liu, Q., and , Y. 2011. Relaxed cross-lingual projection of constituent syntax. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 1192–1201.
Jiang, W., , Y., Huang, L., and Liu, Q., 2015. Automatic adaptation of annotations. Computational Linguistics Journal 41: 119147.
Kim, S., Toutanova, K., and Yu, H. 2012. Multilingual named entity recognition using parallel data and metadata from wikipedia. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 694–702.
Koehn, P., 2005. Europarl: a parallel corpus for statistical machine translation. MT Summit 5: 7986.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., and Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, pp. 177–180.
Kucera, H., and Francis, W. 1979. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Revised and amplified from 1967 version). Providence, RI: Brown University Press.
Li, S., Graça, J. V., and Taskar, B. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, pp. 1389–1398.
Luong, T., Pham, H., and Manning, C. D. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159.
Manion, S. L., and Sainudiin, R. 2013. DAEBAK!: peripheral diversity for multilingual word sense disambiguation. In Proceedings of SemEval, pp. 250–254.
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. 2010. Recurrent neural network based language model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, pp. 1045–1048.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems, pp. 3111–3119.
Miller, G. A., Leacock, C., Tengi, R., and Bunker, R. T. 1993. A semantic concordance. In Proceedings of the Workshop on Human Language Technology, Association for Computational Linguistics, pp. 303–308.
Nasiruddin, M., Tchechmedjiev, A., Blanchon, H., and Schwab, D. 2015. Création rapide et efficace dun système de désambiguïsation lexicale pour une langue peu dotée. In Proceedings of the 22nd TALN (Traitement Automatique des Langues Naturelles) Conference.
Navigli, R., and Ponzetto, S. P., 2012. BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193: 217250.
Navigli, R., Jurgens, D., and Vannella, D. 2013. Semeval-2013: Multilingual word sense disambiguation. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics, vol. 2, pp. 222–231.
Och, F. J., and Ney, H. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 440–447.
Pado, S., and Pitel, G.. 2007. Annotation précise du français en sémantique de rôles par projection cross-linguistique. In Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (communications orales), pp. 271–280.
Pan, S. J., and Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22: 13451359.
Passban, P., Liu, Q., and Way, A., 2017. Translating low-resource languages by vocabulary adaptation from close counterparts. ACM Transactions on Asian and Low-Resource Language Information Processing 16: 29.
Petrov, S., Das, D., and McDonald, R. 2012. A universal part-of-speech tagset. In Proceedings of the 8th International Conference on Language Resources and Evaluation, European Language Resources Association, pp. 2089–2096.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1985. Learning internal representations by error propagation. DTIC Document. No. ICS-8506. California Univ San Diego La Jolla Inst for Cognitive Science.
Salah, M. H., Blanchon, H., Zrigui, M., and Schwab, D. 2016. Amélioration de la traduction automatique dun corpus annoté. In Proceedings of the 23rd TALN (Traitement Automatique des Langues Naturelles) Conference.
Schmid, H. 1995. Treetagger | a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, vol. 46, p. 28. Available at https://protect-eu.mimecast.com/s/STrqCK8y8fB91wiMedpW?domain=cis.uni-muenchen.dehttp://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Schmidhuber, J., 1992. A fixed size storage O (n3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation 4: 243248.
Schuster, M., and Paliwal, K. K., 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45: 26732681.
Schwab, D., Goulian, J., Tchechmedjiev, A., and Blanchon, H. 2012. Ant colony algorithm for the unsupervised word sense disambiguation of texts: comparison and evaluation. In Proceedings of the 25th International Conference on Computational Linguistics, pp. 2389–2404.
Sundermeyer, M., Oparin, I., Gauvain, J.-L., Freiberg, B., Schluter, R., and Ney, H. 2013. Comparison of feedforward and recurrent neural network language models. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8430–8434.
Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112.
Täckström, O., McDonald, R., and Uszkoreit, J. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 477–487.
Täckström, O., McDonald, R., and Nivre, J. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 1061–1071.
Täckström, O., Das, D., Petrov, S., McDonald, R., and Nivre, J., 2013. Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics 1: 112.
Titov, I., and Klementiev, A. 2012. Crosslingual induction of semantic roles. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 647–656.
Van der Plas, L., and Apidianaki, M. 2014. Cross-lingual word sense disambiguation for predicate labelling of french. In Proceedings of the 21st TALN (Traitement Automatique des Langues Naturelles) Conference, pp. 46–55.
Veronis, J., 2000. Annotation automatique de corpus: panorama et état de la technique. Ingénierie des langues 4 (4): 111129.
Veronis, J., Hamon, O., Ayache, C., Belmouhoub, R., Kraif, O., Laurent, D., Nguyen, T. M. H., Semmar, N., Stuck, F., and Zaghouani, W. 2008. Arcade II Action de recherche concertée sur l’alignement de documents et son évaluation. Chapitre2, Editions Hermés.
Van der Maaten, L., and Hinton, G. (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9: 25792605.
Wisniewski, G., Pécheux, N., Gahbiche-Braham, S., and Yvon, F. 2014. Cross-lingual part-of-speech tagging through ambiguous learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, vol. 14, pp. 1779–1785.
Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research, pp. 1–8.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed