Hostname: page-component-6766d58669-kn6lq Total loading time: 0 Render date: 2026-05-21T18:23:58.367Z Has data issue: false hasContentIssue false

Building and using multimodal comparable corpora for machine translation

Published online by Cambridge University Press:  15 June 2016

HAITHEM AFLI
Affiliation:
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails: haithem.afli@lium.univ-lemans.fr, loic.barrault@lium.univ-lemans.fr, holger.schwenk@lium.univ-lemans.fr
LOÏC BARRAULT
Affiliation:
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails: haithem.afli@lium.univ-lemans.fr, loic.barrault@lium.univ-lemans.fr, holger.schwenk@lium.univ-lemans.fr
HOLGER SCHWENK
Affiliation:
Université du Maine, Avenue Olivier Messiaen F-72085 - LE MANS, France e-mails: haithem.afli@lium.univ-lemans.fr, loic.barrault@lium.univ-lemans.fr, holger.schwenk@lium.univ-lemans.fr

Abstract

In recent decades, statistical approaches have significantly advanced the development of machine translation systems. However, the applicability of these methods directly depends on the availability of very large quantities of parallel data. Recent works have demonstrated that a comparable corpus can compensate for the shortage of parallel corpora. In this paper, we propose an alternative to comparable corpora containing text documents as resources for extracting parallel data: a multimodal comparable corpus with audio documents in source language and text document in target language, built from Euronews and TED web sites. The audio is transcribed by an automatic speech recognition system, and translated with a baseline statistical machine translation system. We then use information retrieval in a large text corpus in the target language in order to extract parallel sentences/phrases. We evaluate the quality of the extracted data on an English to French translation task and show significant improvements over a state-of-the-art baseline.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable