Skip to main content
×
×
Home

An overview of the phrase-based statistical machine translation techniques

  • Marta Ruiz Costa-jussà (a1)
Abstract

This work provides a general overview of the statistical machine translation (SMT) scientific field, which is a subfield of machine translation (MT). Specifically, this paper focuses on one of the most popular SMT approaches, that is, the phrase-based system.

The phrase-based translation units are typically extracted using statistical criteria, and they are weighted using different models. These models are log-linearly combined in the decoding, which is in charge of choosing the most probable translation. Significant quality improvements have been produced from original phrase-based SMT systems. Among others, the main challenges are reordering, domain adaptation and evaluation.

Copyright
References
Hide All
Arnold, D., Balkan, L. 1995. Machine translation: an introductory guide. Computational Linguistics 210(4), 577578.
Axelrod, A. E. 2006. Factored Language Models for Statistical Machine Translation. Master Thesis, University of Edinburgh.
Bangalore, S., Bordel, G., Riccardi, G. 2001. Computing consensus translation from multiple machine translation systems, In IEEE Workshop on Automatic Speech Recognition and Understanding, Madonna di Campiglio, Italy, 351–354.
Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., Lagarda, A., Ney, H., Toms, J., Vidal, E. 2009. Statistical approaches to computer-assisted translation. Computational Linguistics 350(1), 328.
Berger, A., Della Pietra, S., Della Pietra, V. 1996. A maximum entropy approach to natural language processing. Computational Linguistics 220(1), 3972.
Bertoldi, N. 2006. Minimum Error Training (Updates). Technical report, Slides of the JHU Summer Workshop.
Bertoldi, N., Federico, M. 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the 4th Workshop on Statistical Machine Translation, Athens, Greece, 182–189. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W09/W09-0432.
Bertoldi, N., Cattoni, R., Cettolo, M., Chen, B., Federico, M. 2006. ITC-irst at the 2006 TC-STAR SLT evaluation campaign. In TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, Spain, 19–24.
Brown, P., Della Pietra, S., Della Pietra, V., Mercer, R. 1993. The mathematics of statistical machine translation. Computational Linguistics 190(2), 263311.
Bulyko, I., Matsourkas, S., Schwartz, R., Nguyen, L., Makhoul, J. 2007. Language model adaptation in machine translation from speech. In Proceedings of the 32nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, Hawai'i, 117–120.
Callison-Burch, C., Talbot, D., Osborne, M. 2004. Statistical machine translation with word- and sentence-aligned parallel corpora. In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 175–182.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J. 2007. (Meta-)evaluation of machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic. Association for Computational Linguistics, 136–158. http://www.aclweb.org/anthology/W/W07/W07-0218.
Callison-Burch, C., Koehn, P., Monz, C., Schroeder, J. 2008. Further meta-evaluation of machine translation. In Proceedings of the 3rd Workshop on Statistical Machine Translation, Columbus, OH. Association for Computational Linguistics, 70–106. http://www.aclweb.org/anthology/W/W08/W08-0309.
Callison-Burch, C., Koehn, P., Monz, C., Schroeder, J. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the 4th Workshop on Statistical Machine Translation, Athens, Greece. Association for Computational Linguistics, 1–28. http://www.aclweb.org/anthology/W/W09/W09-0x01.
Carpuat, M., Wu, D. 2007. Improving statistical machine translation using word sense disambiguation. In Empirical Methods in Natural Language Processing (EMNLP), Prague, 61–72.
Chen, S. F., Goodman, J. T. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, Harvard University.
Chiang, D. 2007. Hierarchical phrase-based translation. Computational Linguistics 33(2), 201228.
Civera, J., Juan, A. 2007. Domain adaptation in statistical machine translation with mixture modelling. In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, 177–180.
Collins, M., Koehn, P., Kucerová, I. 2005. Clause restructuring for statistical machine translation. In Proceedings of the 43th Annual Meeting of the Association for Computational Linguistics, Michigan, 531–540.
Costa-jussà, M. R., Fonollosa, J. A. R. 2009a. An Ngram-based reordering model. Computer Speech & Language 230(3), 362375.
Costa-jussà, M. R., Fonollosa, J. A. R. 2009b. State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems 920(11), 21792185.
Costa-jussà, M. R., Crego, J. M., de Gispert, A., Lambert, P., Khalilov, M., Mariño, J. B., Fonollosa, J. A. R., Banchs, R. 2006. TALP phrase-based statistical translation system for European language pairs. In Human Language Technology Conference (HLT-NAACL'06): Proceedings of the Workshop on Statistical Machine Translation, New York City, 142–145.
Costa-jussà, M. R., Fonollosa, J. A. R., Monte, E. 2011. Recursive alignment block classification technique for word reordering in statistical machine translation. Language Resources and Evaluation Journal 450(2), 165179.
Crego, J. M. 2008. Architecture and Modeling for N-gram-based Statistical Machine Translation. PhD thesis, Department of Signal Theory and Communications, Universitat Politècnica de Catalunya (UPC).
Crego, J. M., Mariño, J. B. 2007. Improving SMT by coupling reordering and decoding. Machine Translation 200(3), 199215.
Crego, J. M., Yvon, F. 2009. Gappy translation units under left-to-right SMT decoding. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation (EAMT'09), Barcelona.
Crego, J. M., de Gispert, A., Lambert, P., Costa-jussà, M. R., Khalilov, M., Banchs, R., Mariño, J. B., Fonollosa, J. A. R. 2006. N-gram-based SMT system enhanced with reordering patterns. In Human Language Technology Conference (HLT-NAACL'06): Proceedings of the Workshop on Statistical Machine Translation, New York City, 162–165.
de Gispert, A., Marino, J. B. 2008. On the impact of morphology in English to Spanish statistical MT. Speech Communication 50, 10341046.
DeNero, J., Chiang, D., Knight, K. 2009. Fast consensus decoding over translation forests. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 567–575.
Doddington, G. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Human Language Technology Conference, HLT-NAACL'02, San Diego, 138–145.
Doi, T., Hwang, Y., Imamura, K., Okuma, H., Sumita, E. 2005. Nobody is perfect: ATR's hybrid approach to spoken language translation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT'04, Pittsburgh, PA, USA, 55–62.
Dorr, B. J. 1994. Machine translation: a view from the lexicon. Computational Linguistics 200(4), 670676.
Eck, M., Vogel, S., Waibel, A. 2004. Language model adaptation for statistical machine translation based on information retrieval. In Proceedings of the LREC, Lisbon, Portugal, 327–330.
Farrús, M., Costa-jussà, M. R., Mariño, J. B., Fonollosa, J. A. R. 2010. Linguistic-based evaluation criteria to identify statistical machine translation errors. In Proceedings of the 14th Annual Meeting of the EAMT: European Association for Machine Translation, Saint Rapahel.
Finch, A., Sumita, E. 2008. Dynamic model interpolation for statistical machine translation. In Proceedings of the 3rd Workshop on Statistical Machine Translation, Colombus, USA, 208–215.
Fiscus, G. 1997. A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, USA.
Flanagan, M. A. 1994. Error classification for MT evaluation. In Proceedings of the AMTA, Columbia, 65–72.
Foster, G., Kuhn, R. 2007. Mixture-model adaptation for SMT. In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, Czech Republic, 128–135.
Fraser, A., Marcu, D. 2006. Measuring Word Alignment Quality for Statistical Machine Translation. Technical report, ISI/University of Southern California, California.
Frederking, R., Nirenburg, S. 1994. Three heads are better than one. In Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, Germany.
Giménez, J., Màrquez, L. 2007. Linguistic features for automatic evaluation of heterogenous MT systems. In Proceedings of the 2nd Workshop on Statistical Machine Translation, Prague, 256–264.
Haque, R., Kumar Naskar, S., Ma, Y., Way, A. 2009. Using supertags as source language context in SMT. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation (EAMT), Barcelona, 234–241.
Hasan, S., Ney, H. 2005. Clustered language models based on regular expressions for statistical machine translation. In Proceedings of the 10th Annual Conference of The European Association for Machine Translation (EAMT), Budapest, Hungary, 119–125.
Hassan, H., Hearne, M., Way, A., Sima'an, K. 2006. Syntactic phrase-based statistical machine translation. In Proceedings of the 1st IEEE/ACL Workshop on Spoken Language Technology, Aruba.
Jayaraman, S., Lavie, A. 2005. Multi-engline machine translation guided by explicit word matching. In Proceedings of the 10th Conference of the European Association for Machine Translation, Budapest, Hungary, 143–152.
Kanthak, S., Vilar, D., Matusov, E., Zens, R., Ney, H. 2005. Novel reordering approaches in phrase-based statistical machine translation. In Annual Meeting of the Association for Computational Linguistics: Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond (WMT), Ann Arbor, MI, 167–174.
Khalilov, M., Costa-jussà, M. R., Henríquez, C. A., Fonollosa, J. A. R., Hernández, A., Mariño, J. B., Banchs, R. E., Chen, B., Zhang, M., Aw, A., Li, H. 2008. The TALP & I2R SMT systems for IWSLT 2008. In Proceedings of the International Workshop on Spoken Language Translation, Hawaii, USA, 116–123.
Khalilov, M., Fonollosa, J. A. R., Dras, M. 2009. A new subtree-transfer approach to syntax-based reordering for statistical machine translation. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation (EAMT'09), Barcelona, Spain, 198–204.
Knight, K., Al-Onaizan, Y. 1998. Translation with finite-state devices. In Proceedings of the 4th Conference of the Association for Machine Translation in the Americas, AMTA'02, Langhorne, 421–437.
Koehn, K., Knight, K. 2003. Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 347–354.
Koehn, P., Schroeder, J. 2007. Experiments in domain adaptation for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics: Proceedings of the 2nd Workshop on Statistical Machine Translation (WMT), Prague, 224–227.
Koehn, P., Och, F. J., Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the Human Language Technology Conference, HLT-NAACL'03, Edmonton, Canada, 48–54.
Koehn, P., Amittai, A., Birch, A., Callison-Burch, C., Osborne, M., Talbot, D., White, M. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of International Workshop on Spoken Languages Translation, Pittsburgh.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, 177–180.
Kumar, S., Byrne, W. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceeding of the Human Language Technology Conference, HLT-NAACL'04, Boston, MA, USA, 169–176.
Kumar, S., Macherey, W., Dyer, C., Och, F. 2009. Efficient minimum error rate training and minimum Bayes-risk decoding for translation hypergraphs and lattices. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 163–171.
Lambert, P. 2008. Exploiting Lexical Information and Discriminative Alignment Training in Statistical Machine Translation. PhD thesis, Software Department, Universitat Politècnica de Catalunya (UPC).
Lambert, P., Banchs, R. E. 2006. Tuning machine translation parameters with SPSA. In Proceedings of the International Workshop on Spoken Language Translation, Kyoto, Japan, 190–196.
Langlais, P., Gotti, F. 2006. Phrase-based SMT with shallow tree-phrases. In Proceedings of the Workshop on Statistical Machine Translation, New York, USA, 39–46.
Langlais, P., Patry, A. 2007. Translating unknown words by analogical learning. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic. Association for Computational Linguistics, 877–886. http://www.aclweb.org/anthology/D/D07/D07-1092.
Lavie, A., Agarwal, A. 2007. METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Annual Meeting of the Association for Computational Linguistics: Proceedings of the 2nd Workshop on Statistical Machine Translation (WMT), Prague, Czech Republic, 228–231.
Lopez, A. 2007. A Survey of Statistical Machine Translation. Storming Media.
Lopez, A. 2008. Machine Translation by Pattern Matching. PhD thesis, University of Maryland.
Macherey, W., Och, F., Thayer, I., Uszkoreit, J. 2008. Lattice-based minimum error rate training for statistical machine translation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Hawaii, 725–734.
Mariño, J. B., Banchs, R. E., Crego, J. M., de Gispert, A., Lambert, P., Fonollosa, J. A. R., Costa-jussà, M. R. 2006. N-gram based machine translation. Computational Linguistics 320(4), 527549.
Matusov, E., Zens, R., Vilar, D., Mauser, A., Popovic, M., Hasan, S., Ney, H. 2006. The RWTH machine translation system. In TC-STAR Workshop on Speech-to-Speech Translation, Barcelona, Spain, 31–36.
Matusov, E., Leusch, G., Banchs, R. E., Bertoldi, N., Dechelotte, D., Federico, M., Kolss, M., Lee, Y., Marino, J. B., Paulik, M., Roukos, S., Schwenk, H., Ney, H. 2008. System combination for machine translation of spoken and written language. IEEE Transactions on Audio, Speech and Language Processing 160(7), 1222–1237.
Mauser, A., Matusov, E., Ney, H. 2006. Training a statistical machine translation system without GIZA++. In Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC'06, Genova, 715–720.
McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., Bourlard, H. 2004. On the use of information retrieval measures for speech recognition evaluation. In Proceedings of the IDIAP-RR 73, Martigny, Switzerland. IDIAP.
Menezes, A., Toutanova, K., Quirk, C. 2006. Microsoft research treelet translation system: NAACL 2006 Europarl evaluation. In Proceedings on the Workshop on Statistical Machine Translation, New York City. Association for Computational Linguistics, 158–161.
Nelder, J. A., Mead, R. 1965. A simplex method for function minimization. The Computer Journal 7, 308313.
Nießen, S., Ney, H. 2001. Morpho-syntactic analysis for reordering in statistical machine translation. In Proceedings of the MT-Summit VII, Santiago de Compostela, Spain, 247–252.
Nomoto, T. 2004. Multi-engine machine translation with voted language model. In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 494–501.
Och, F. J. 1999. An efficient method for determining bilingual word classes. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, Bergen, Norway, 71–76.
Och, F. J. 2002. Statistical Machine Translation: From Single-Word Models to Alignment Templates. PhD thesis, RWTH Aachen University, Aachen, Germany.
Och, F. J. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41th Annual Meeting of the Association for Computational Linguistics, Sapporo, 160–167.
Och, F. J., Ney, H. 2000. A comparison of alignment models for statistical machine translation. In Proceedings of the 18th Conference on Computational Linguistics, Morristown, NJ, USA, 1086–1090.
Och, F. J., Ney, H. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 300(4), 417449.
Och, F.-J., Gildea, D., Khudanpur, S., Sarkar, A., Yamada, K., Fraser, A., Kumar, S., Shen, L., Smith, D., Eng, K., Jain, V., Jin, Z., Radev, D. 2004. A smorgasbord of features for statistical machine translation. In Proceedings of the Human Language Technology Conference, HLT-NAACL'04, 161–168.
Olive, J. 2005. Global autonomous language exploitation. DARPA/IPTOProposer Information Pamphlet.
Papineni, K., Roukos, S., Ward, T., Zhu, W-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 311–318.
Popovic, M. 2009. Machine Translation:Statistical Approach with Additional Linguistic Knowledge. PhD thesis, RWTH University.
Popovic, M., Ney, H. 2006. POS-based word reorderings for statistical machine translation. In Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC'06, Genoa, Genoa, Italy, 1278–1283.
Popovic, M., Ney, H. 2009. Syntax-oriented evaluation measures for machine translation output. In Proceedings of the 4th Workshop on Statistical Machine Translation, Athens, 29–32.
Przybocki, M., Sanders, G., Le, A. 2006. Edit distance: a metric for machine translation evaluation. In Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC'06, Genoa, Italy, 2038–2043.
Quirk, C., Menezes, A., Cherry, C. 2005. Dependency treelet translation: syntactically informed phrasal SMT. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), Ann Arbor, MI, 271–279.
Rogati, M. 2009. Domain Adaptation of Translation Models for Multilingual Applications. PhD thesis, Carnegie Mellon University.
Rosti, A.-V.I., Ayan, N. F., Xiang, S. B., Schwartz Matsoukas, R., Dorr, B. J. 2007. Combining outputs from multiple machine translation systems. In Proceedings of the Human Language Technology Conference, HLT-NAACL'07, Rocherster, USA, 228–235.
Schwenk, H., Estève, Y. 2008. Data selection and smoothing in an open-source system for the 2008 NIST machine translation evaluation. In Proceedings of the Interspeech'08, Brisbane, Australia.
Schwenk, H., Costa-jussà, M. R., Fonollosa, J. A. R. 2006. Continuous Space Language Models for the IWSLT 2006 Task. In Proceedings of the International Workshop on Spoken Language Translation, Kyoto, Japan, 166–173.
Schwenk, H., Costa-jussa, M. R., Fonollosa, J. A. R. 2007. Smooth bilingual n-gram translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic. Association for Computational Linguistics, 430–438. http://www.aclweb.org/anthology/D/D07/D07-1045.
Sim, K. C., Byrne, W. J., Gales, M. J. F., Sahbi, H., Woodland, P. C. 2007. Consensus network decoding for statistical machine translation system combination. In Proceedings of the ICASSP, 4, Rocherster, USA, 105–108.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the Association for Machine Translation in the Americas, Sydney, Australia.
Snover, M., Dorr, B., Schwartz, R., Makhoul, J., Micciula, L., Weischedel, R. 2005. A Study of Translation Error Rate with Targeted Human Annotation. Technical report LAMP-TR-126,CS-TR-4755,UMIACS-TR-2005-58, University of Maryland, College Park and BBN Technologies.
Stroppa, N., van de Bosch, A., Way, A. 2007. Exploiting source similarity for SMT using context-informed features. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine TRanslation (TMI), Skövde, 231–240.
Tillmann, C. 2004. A unigram orientation model for statistical machine translation. In Proceedings of the Human Language Technology Conference, HLT-NAACL'04, Boston, 101–104.
Tillmann, C., Ney, H. 2003. Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Computational Linguistics 290(1), 97–133.
Vilar, D., Stein, D., Zhang, Y., Matusov, E., Mauser, A., Bender, O., Mansour, S., Ney, H. 2008. The RWTH machine translation system for IWSLT 2008. In Proceedings of the International Workshop on Spoken Language Translation, Waikiki, Hawaii, 108–115.
Vilar, D., Xu, J., Fernando-D'Haro, L., Ney, H. 2006. Error analysis of statistical machine translation output. In Proceedings of the LREC, Genoa, Italy.
Wang, C., Collins, M., Koehn, P. 2007. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, 737–745.
Wu, D. 1996. A polynomial-time algorithm for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics, Santa Cruz.
Wu, H., Wang, H., Zong, C. 2008. Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics, Beijing, China, 1, 993–1000.
Xia, F., McCord, M. 2004. Improving a statistical mt system with automatically learned rewrite patterns. In Proceedings of the 20th International Conference on Computational Linguistics, Morristown, 508.
Yamada, K., Knight, K. 2002. A decoder for syntax-based statistical MT. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 303–310.
Zens, R., Och, F. J., Ney, H. 2002. Phrase-based statistical machine translation. In KI-2002: Advances in Artificial Intelligence, Jarke, M., Koehler, J. & Lakemeyer, G. (eds), Lecture Notes in Artificial Intelligence 2479, 1832. Springer Verlag.
Zhang, Y., Zens, R., Ney, H. 2007. Chunk-level reordering of source language sentences with automatically learned rules for statistical machine translation. In Proceedings of the Human Language Technology Conference (HLT-NAACL'06): Proceedings of the Workshop on Syntax and Structure in Statistical Translation (SSST), Rochester, 1–8.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

The Knowledge Engineering Review
  • ISSN: 0269-8889
  • EISSN: 1469-8005
  • URL: /core/journals/knowledge-engineering-review
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed