Extracting parallel phrases from comparable data for machine translation†

SANJIKA HEWAVITHARANA; STEPHAN VOGEL

doi:10.1017/S1351324916000139

Extracting parallel phrases from comparable data for machine translation†

Published online by Cambridge University Press: 15 June 2016

SANJIKA HEWAVITHARANA and

STEPHAN VOGEL

Show author details

SANJIKA HEWAVITHARANA: Affiliation:
Raytheon BBN Technologies, Cambridge, MA 02138, USA email: shewavit@bbn.com
STEPHAN VOGEL: Affiliation:
Qatar Computing Research Institute, Doha, Qatar email: svogel@qf.org.qa

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine translation and other natural language processing applications. In this paper, we address the task of detecting parallel phrase pairs embedded in comparable sentence pairs. We present a novel phrase alignment approach that is designed to only align parallel sections bypassing non-parallel sections of the sentence. We compare the proposed approach with two other alignment methods: (1) the standard phrase extraction algorithm, which relies on the Viterbi path of the word alignment, (2) a binary classifier to detect parallel phrase pairs when presented with a large collection of phrase pair candidates. We evaluate the accuracy of these approaches using a manually aligned data set, and show that the proposed approach outperforms the other two approaches. Finally, we demonstrate the effectiveness of the extracted phrase pairs by using them in Arabic–English and Urdu–English translation systems, which resulted in improvements upto 1.2 Bleu over the baseline. The main contributions of this paper are two-fold: (1) novel phrase alignment algorithms to extract parallel phrase pairs from comparable sentences, (2) evaluating the utility of the extracted phrases by using them directly in the MT decoder.

Information

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 4: Machine Translation Using Comparable Corpora , July 2016 , pp. 549 - 573

DOI: https://doi.org/10.1017/S1351324916000139 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

Part of this work was conducted when the authors were affiliated to the Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

References

Banerjee, S. and Lavie, A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, USA, June, pp. 65–72.Google Scholar

Bourdaillet, J., Huet, S., Langlais, P. and Lapalme, G. 2010. TransSearch: from a bilingual concordancer to a translation finder. Machine Translation 24 (3–4): 241–71, December.CrossRef Google Scholar

Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263–311.Google Scholar

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. 2006. Online passive-agressive algorithms. Journal of Machine Learning Research 7 (March): 551–85.Google Scholar

Fung, P. and Cheung, P. 2004. Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 57–63.Google Scholar

Fung, P. and Yee, L. Y. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, Canada, pp. 414–20.Google Scholar

Gupta, M., Hewavitharana, S. and Vogel, S. 2011. Extending a probabilistic phrase alignment approach for SMT. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), San Francisco, CA, December.Google Scholar

Gupta, R., Pal, S. and Bandyopadhyay, S. 2013. Improving MT system using extracted parallel fragments of text from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria, August.Google Scholar

Hewavitharana, S. and Vogel, S. 2011. Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, Oregon, pp. 61–8.Google Scholar

Hewavitharana, S. and Vogel, S. 2013. Extracting parallel phrases from comparable data. In Sharoff, S., Reinhard, R., Zweigenbaum, P., and Fung, P. (eds.), Building and Using Comparable Corpora. Berlin Heidelberg: Springer, pp. 191–204.CrossRef Google Scholar

Kikui, G., Sumita, E., Takezawa, T. and Yamamoto, S. 2003. Creating corpora for speech-to-speech translation. In Proceedings of EUROSPEECH, Geneva, pp. 381–84.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June.Google Scholar

Kumano, T., Tanaka, H. and Tokunaga, T. 2007. Extracting phrasal alignments from comparable corpora by using joint probability SMT model. In Proceedings of the International Conference on Theoretical and Methodological Issues in Machine Translation, Skvde, Sweden, September.Google Scholar

Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477–504.CrossRef Google Scholar

Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–8.Google Scholar

Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 160–67.Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July, pp. 311–18.Google Scholar

Quirk, C., Udupa, R. U. and Menezes, A. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark, pp. 377–84.Google Scholar

Rapp, R. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, pp. 320–22.Google Scholar

Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–26.Google Scholar

Resnik, P. and Smith, N. 2003. The web as a parallel corpus. Computational Linguistics 29 (3): 349–80.CrossRef Google Scholar

Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas, Cambridge, MA.Google Scholar

Tillmann, C. and Hewavitharana, S. 2011. An efficient unified alignment algorithm for bilingual data. In Proceedings of Interspeech 2011, Florence, Italy, August.CrossRef Google Scholar

Tillmann, C. and Hewavitharana, S. 2013. A unified alignment algorithm for bilingual data. Natural Language Engineering 19 (01): 33–60, Januray.CrossRef Google Scholar

Tillmann, C. and Xu, J.-M. 2009. A simple sentence-level extraction algorithm for comparable data. In Companion Vol. of NAACL HLT 09, Boulder, CA, June.CrossRef Google Scholar

Utiyama, M. and Isahara, H. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 72–9.Google Scholar

Vogel, S. 2003. SMT decoder dissected: word reordering. In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, Beijing, China, October, pp. 561–66.Google Scholar

Vogel, S. 2005. PESA: phrase pair extraction as sentence splitting. In Proceedings of the Machine Translation Summit X, Phuket, Thailand, September.Google Scholar

Zhao, B. and Vogel, S. 2002a. Adaptive parallel sentence mining from web bilingual news collection. In Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, pp. 745–48.Google Scholar

Zhao, B. and Vogel, S. 2002b. Full-text story alignment models for Chinese-English bilingual news corpora. In Proceedings of the ICSLP '02, Denver, CO, September.CrossRef Google Scholar

Article contents

Extracting parallel phrases from comparable data for machine translation†

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests