Hostname: page-component-797576ffbb-lm8cj Total loading time: 0 Render date: 2023-12-03T11:07:37.878Z Has data issue: false Feature Flags: { "corePageComponentGetUserInfoFromSharedSession": true, "coreDisableEcommerce": false, "useRatesEcommerce": true } hasContentIssue false

Building and using multimodal comparable corpora for machine translation

Published online by Cambridge University Press:  15 June 2016

Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails:,,
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails:,,
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails:,,


In recent decades, statistical approaches have significantly advanced the development of machine translation systems. However, the applicability of these methods directly depends on the availability of very large quantities of parallel data. Recent works have demonstrated that a comparable corpus can compensate for the shortage of parallel corpora. In this paper, we propose an alternative to comparable corpora containing text documents as resources for extracting parallel data: a multimodal comparable corpus with audio documents in source language and text document in target language, built from Euronews and TED web sites. The audio is transcribed by an automatic speech recognition system, and translated with a baseline statistical machine translation system. We then use information retrieval in a large text corpus in the target language in order to extract parallel sentences/phrases. We evaluate the quality of the extracted data on an English to French translation task and show significant improvements over a state-of-the-art baseline.

Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


This work has been partially funded by the French Government under the project DEPART.


Abdul-Rauf, S., and Schwenk, H. 2011. Parallel sentence generation from comparable corpora for improved smt. Machine Translation 25 (4): 341375.Google Scholar
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D. and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263311.Google Scholar
Cettolo, M., Federico, M. and Bertoldi, N. 2010. Mining parallel fragments from comparable texts. In Proceedings of the 7th International Workshop on Spoken Language Translation, Paris, France.Google Scholar
Deléglise, P., Estève, Y., Meignier, S. and Merlin, T. 2009. Improvements to the LIUM french ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In Interspeech 2009, Brighton, UK.Google Scholar
Fung, P. and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04. Geneva, Switzerland.Google Scholar
Gao, Q. and Vogel, S. 2008. Parallel implementations of word alignment tool. In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP '08, Columbus, Ohio, USA, pp. 49–57.Google Scholar
Grézl, F. and Fousek, P. 2008. Optimizing bottle-neck features for LVCSR. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Signal Processing Society, Las Vegas, USA, pp. 4729–32.Google Scholar
Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, Portland, Oregon, USA, pp. 61–8.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL '07. Prague, Czech Republic, pp. 177–80.Google Scholar
Koehn, P., Och, F. J. and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03. Edmonton, Canada, pp. 48–54.Google Scholar
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.Google Scholar
Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44. Sydney, Australia, pp. 81–8.Google Scholar
Ogilvie, P. and Callan, J. 2001. Experiments using the lemur toolkit. In Procedding of the Trenth Text Retrieval Conference (TREC-10). National Institute of Standards and Technology Special Publication 500-207.Google Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02. Philadelphia, USA, pp. 311–18.Google Scholar
Paulik, M. and Waibel, A. 2009. Automatic translation from parallel speech: simultaneous interpretation as mt training data. ASRU, Merano, Italy.Google Scholar
Quirk, Q., Udupa, R. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In In Proceedings of MT Summit XI, European Association for Machine Translation, Copenhagen, Denmark.Google Scholar
Resnik, P. and Smith, N. A. 2003. The web as a parallel corpus. Comput. Linguist. 29 (3): 349–80.Google Scholar
Riesa, J. and Marcu, D. 2012. Automatic parallel fragment extraction from noisy data. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT '12. Montreal, Quebec, Canada, pp. 538–42.Google Scholar
Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., and Estève, Y. 2011. LIUM's systems for the IWSLT 2011 speech translation tasks. In International Workshop on Spoken Language Translation 2011, San Francisco, USA.Google Scholar
Rousseau, A., Deléglise, P. and Estève, Y. 2012. Ted-lium: an automatic speech recognition dedicated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey.Google Scholar
Schwenk, H. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation. In Proceedings of the International Workshop on Spoken Language Translation. Waikiki, Hawai'i, USA, pp. 182–89.Google Scholar
Snover, S., Dorr, B., Schwartz, R., Micciulla, M., and Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pp. 223–31.Google Scholar
Stolcke, A. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, pp. 257–86.Google Scholar
Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning japanese-english news articles and sentences. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL '03, pp. 72–9.Google Scholar
Yang, C. C. and Li, K. W. 2003. Automatic construction of english/chinese parallel corpora. Journal of the Association for Information Science and Technology 54 (8): 730–42.Google Scholar
Zhao, B. and Vogel, S. 2002. Adaptive parallel sentences mining from web bilingual news collection. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, Washington, DC, USA, IEEE Computer Society.Google Scholar