Building and using multimodal comparable corpora for machine translation†

HAITHEM AFLI; LOÏC BARRAULT; HOLGER SCHWENK

doi:10.1017/S1351324916000152

Building and using multimodal comparable corpora for machine translation†

Published online by Cambridge University Press: 15 June 2016

HAITHEM AFLI ,

LOÏC BARRAULT and

HOLGER SCHWENK

Show author details

HAITHEM AFLI: Affiliation:
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails: haithem.afli@lium.univ-lemans.fr, loic.barrault@lium.univ-lemans.fr, holger.schwenk@lium.univ-lemans.fr
LOÏC BARRAULT: Affiliation:
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails: haithem.afli@lium.univ-lemans.fr, loic.barrault@lium.univ-lemans.fr, holger.schwenk@lium.univ-lemans.fr
HOLGER SCHWENK: Affiliation:
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails: haithem.afli@lium.univ-lemans.fr, loic.barrault@lium.univ-lemans.fr, holger.schwenk@lium.univ-lemans.fr

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In recent decades, statistical approaches have significantly advanced the development of machine translation systems. However, the applicability of these methods directly depends on the availability of very large quantities of parallel data. Recent works have demonstrated that a comparable corpus can compensate for the shortage of parallel corpora. In this paper, we propose an alternative to comparable corpora containing text documents as resources for extracting parallel data: a multimodal comparable corpus with audio documents in source language and text document in target language, built from Euronews and TED web sites. The audio is transcribed by an automatic speech recognition system, and translated with a baseline statistical machine translation system. We then use information retrieval in a large text corpus in the target language in order to extract parallel sentences/phrases. We evaluate the quality of the extracted data on an English to French translation task and show significant improvements over a state-of-the-art baseline.

Information

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 4: Machine Translation Using Comparable Corpora , July 2016 , pp. 603 - 625

DOI: https://doi.org/10.1017/S1351324916000152 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

This work has been partially funded by the French Government under the project DEPART.

References

Abdul-Rauf, S., and Schwenk, H. 2011. Parallel sentence generation from comparable corpora for improved smt. Machine Translation 25 (4): 341–375.Google Scholar

Brown, P. F., Pietra, V. J. D., Pietra, S. A. D. and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263–311.Google Scholar

Cettolo, M., Federico, M. and Bertoldi, N. 2010. Mining parallel fragments from comparable texts. In Proceedings of the 7th International Workshop on Spoken Language Translation, Paris, France.Google Scholar

Deléglise, P., Estève, Y., Meignier, S. and Merlin, T. 2009. Improvements to the LIUM french ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate? In Interspeech 2009, Brighton, UK.Google Scholar

Fung, P. and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics, COLING '04. Geneva, Switzerland.Google Scholar

Gao, Q. and Vogel, S. 2008. Parallel implementations of word alignment tool. In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP '08, Columbus, Ohio, USA, pp. 49–57.Google Scholar

Grézl, F. and Fousek, P. 2008. Optimizing bottle-neck features for LVCSR. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Signal Processing Society, Las Vegas, USA, pp. 4729–32.Google Scholar

Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, BUCC '11, Portland, Oregon, USA, pp. 61–8.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL '07. Prague, Czech Republic, pp. 177–80.Google Scholar

Koehn, P., Och, F. J. and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL '03. Edmonton, Canada, pp. 48–54.Google Scholar

Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477–504.CrossRef Google Scholar

Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44. Sydney, Australia, pp. 81–8.Google Scholar

Ogilvie, P. and Callan, J. 2001. Experiments using the lemur toolkit. In Procedding of the Trenth Text Retrieval Conference (TREC-10). National Institute of Standards and Technology Special Publication 500-207.Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02. Philadelphia, USA, pp. 311–18.Google Scholar

Paulik, M. and Waibel, A. 2009. Automatic translation from parallel speech: simultaneous interpretation as mt training data. ASRU, Merano, Italy.Google Scholar

Quirk, Q., Udupa, R. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In In Proceedings of MT Summit XI, European Association for Machine Translation, Copenhagen, Denmark.Google Scholar

Resnik, P. and Smith, N. A. 2003. The web as a parallel corpus. Comput. Linguist. 29 (3): 349–80.Google Scholar

Riesa, J. and Marcu, D. 2012. Automatic parallel fragment extraction from noisy data. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT '12. Montreal, Quebec, Canada, pp. 538–42.Google Scholar

Rousseau, A., Bougares, F., Deléglise, P., Schwenk, H., and Estève, Y. 2011. LIUM's systems for the IWSLT 2011 speech translation tasks. In International Workshop on Spoken Language Translation 2011, San Francisco, USA.Google Scholar

Rousseau, A., Deléglise, P. and Estève, Y. 2012. Ted-lium: an automatic speech recognition dedicated corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey.Google Scholar

Schwenk, H. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation. In Proceedings of the International Workshop on Spoken Language Translation. Waikiki, Hawai'i, USA, pp. 182–89.Google Scholar

Snover, S., Dorr, B., Schwartz, R., Micciulla, M., and Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, pp. 223–31.Google Scholar

Stolcke, A. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, pp. 257–86.Google Scholar

Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning japanese-english news articles and sentences. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL '03, pp. 72–9.Google Scholar

Yang, C. C. and Li, K. W. 2003. Automatic construction of english/chinese parallel corpora. Journal of the Association for Information Science and Technology 54 (8): 730–42.Google Scholar

Zhao, B. and Vogel, S. 2002. Adaptive parallel sentences mining from web bilingual news collection. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, Washington, DC, USA, IEEE Computer Society.Google Scholar

Article contents

Building and using multimodal comparable corpora for machine translation†

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests