Efficient Phrase Table pruning for Hindi to English machine translation through syntactic and marker-based filtering and hybrid similarity measurement

NILADRI CHATTERJEE; SUSMITA GUPTA

doi:10.1017/S1351324918000360

Efficient Phrase Table pruning for Hindi to English machine translation through syntactic and marker-based filtering and hybrid similarity measurement

Published online by Cambridge University Press: 09 October 2018

NILADRI CHATTERJEE and

SUSMITA GUPTA

Show author details

NILADRI CHATTERJEE: Affiliation:
Department of Mathematics, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016India e-mails: niladri.iitd@gmail.com, susmita.chakraborty@gmail.com
SUSMITA GUPTA: Affiliation:
Department of Mathematics, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016India e-mails: niladri.iitd@gmail.com, susmita.chakraborty@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

For a given training corpus of parallel sentences, the quality of the output produced by a translation system relies heavily on the underlying similarity measurement criteria. A phrase-based machine translation system derives its output through a generative process using a Phrase Table comprising source and target language phrases. As a consequence, the more effective the Phrase Table is, in terms of its size and the output that may be derived out of it, the better is the expected outcome of the underlying translation system. However, finding the most similar phrase(s) from a given training corpus that can help generate a good quality translation poses a serious challenge. In practice, often there are many parallel phrase entries in a Phrase Table that are either redundant, or do not contribute to the translation results effectively. Identifying these candidate entries and removing them from the Phrase Table will not only reduce the size of the Phrase Table, but should also help in improving the processing speed for generating the translations. The present paper develops a scheme based on syntactic structure and the marker hypothesis (Green 1979, The necessity of syntax markers: two experiments with artificial languages, Journal of Verbal Learning and Behavior) for reducing the size of a Phrase Table, without compromising much on the translation quality of the output, by retaining the non-redundant and meaningful parallel phrases only. The proposed scheme is complemented with an appropriate similarity measurement scheme to achieve maximum efficiency in terms of BLEU scores. Although designed for Hindi to English machine translation, the overall approach is quite general, and is expected to be easily adaptable for other language pairs as well.

Information

Type: Article
Information: Natural Language Engineering , Volume 25 , Issue 1 , January 2019 , pp. 171 - 210

DOI: https://doi.org/10.1017/S1351324918000360 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abdelali, A., Guzman, F., Sajjad, H., and Vogel, S. 2014. The amara corpus: building parallel language resources for the educational domain. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014).Google Scholar

Badr, G., Joanis, E., Larkin, S., and Kuhn, R. 2007. Manageable phrase-based statistical machine translation models. Computer Recognition Systems 2 (CORES 2007), pp. 437–44. Advances in Soft Computing. Wroclaw, Poland: Springer.Google Scholar

Balyan, R., and Chatterjee, N. 2015. Translating Noun Compounds using Semantic Relations, In Proceeding of the International Speech Communication Association of Journal, ISCA, vol. 32, no. 1, pp. 91–108.Google Scholar

Bharati, A., Sangal, R., and Sharma, D. 2007. SSF: Shakti standard format guide. Technical Report TR-LTRC-33, LTRC, IIIT-Hyderabad.Google Scholar

Bhargava, R., Baoni, A., Jain, H., and Sharma, Y. 2016. BITS_PILANI@DPIL-FIRE2016: Paraphrase Detection in Hindi Language using Syntactic Features of Phrase. In Proceedings of Forum of Information Retireval (FIRE-2016).Google Scholar

Bhattacharyya, P. 2010 IndoWordNet. In Proceedings of the Language Resources and Evaluation Conference (LREC-2010). Malta.Google Scholar

Bojar, O., Diatka, V., Rychly, P., Stranak, P., Suchomel, V., Tamchyna, A., and Zeman, D. 2014. Hindencorphindi-English and Hindi-only corpus for machine translation. In Proceedings of Language Resources and Evaluation Conference.Google Scholar

Bojar, O., Stranák, P., Zeman, D., Jain, G., and Damani, O. P. 2010. UMC002: English-Hindi parallel corpus. Data/software, UFAL, Charles University in Prague. http://ufal.mff.cuni.cz/umc/.Google Scholar

Cao, H., Finch, A., and Sumita, E., 2010. Syntactic constraints on phrase extraction for phrase-based machine translation. In Proceedings of the 4th Workshop Syntax and Structure in Statistical Translation (SSST-4), Beijing, China, pp. 28–33.Google Scholar

Chatterjee, N. 2001. A statistical approach for similarity measurement between sentences for EBMT. In Proceedings of the Symposium on Translation Support Systems (STRANS-2001). Indian Institute of Technology. Kanpur.Google Scholar

Chooi-Ling, G., and Sumita, E., 2009. A feature-rich supervised word alignment model for phrase-based statistical machine translation. International Journal on Asian Language Processing 19 (3): 109–125.Google Scholar

Deng, Y., Xu, J., and Gao, Y. 2008. Phrase Table training for precision and recall: what makes a good phrase and a good phrase pair? In Proceedings of the ACL/HLT 2008. Columbus, OH, USA.Google Scholar

Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. 2013. PPDB: the paraphrase database. In Proceedings of the Joint Human Language Technology Conference/Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL-2013), pp. 758–764.Google Scholar

Green, T. 1979. The necessity of syntax markers. two experiments with artificial languages. Journal of Verbal Learning and Behavior. 18: 481–96.Google Scholar

Imamura, K. 2002. Application of translation knowledge acquired by hierarchical phrase alignment for pattern-based MT. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar

Johnson, H., Martin, J., Foster, G., and Kuhn, R. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL–2007), Prague, Czech Republic.Google Scholar

Koehn, P. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the 6th Biennial Conference of the Association for Machine Translation in the Americas, Washington, DC.Google Scholar

Koehn, P., Och, F., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, May 27–June 01, Edmonton, Canada.Google Scholar

Kunchukuttan, A., Mehta, P., Bhattacharyya, P. 2017. The IIT Bombay English-Hindi parallel corpus. In Proceedings of the Language Resources and Evaluation Conference (LREC-2018), Under review.Google Scholar

Ling, W., Graca, J., Trancoso, I., and Black, A. 2012. Entropy-based pruning for phrase-based machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar

Maruyama, H., and Watanabe, H. 1992. Tree cover search algorithm for example-based translation. In Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1992), Kyoto, Japan.Google Scholar

Nagao, M. 1984. A framework of a mechanical translation between Japanese and english by analogy principle. In Elithorn, A. and Banerji, R. (eds.), Artificial and Human Intelligence. Amsterdam: North Holland.Google Scholar

Nayak, T., Pal, S., Naskar, S., Bandyopadhyay, S., and Genabith, J., 2016. Beyond translation memories: generating translation suggestions based on parsing and POS tagging. In Proceedings of the 2nd Workshop on Natural Language Processing for Translation Memories, Portoroz, Slovenia, pp. 12–20.Google Scholar

Niehues, J., and Waibel, A. 2012. Detailed analysis of different strategies for phrase table adaptation in SMT. In Proceedings of the American Machine Translation Association (AMTA-2012). San Diego, CA.Google Scholar

Och, F. J. 2003. Minimum error rate training for statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL-2003).Google Scholar

Och, F. J., Tillmann, C., and Ney, H. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint Conerence. of Empirical Methods in Natural Language Processing and Very Large Corpora.Google Scholar

Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL-2002).Google Scholar

Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. 2015. PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, vol. 2, Short papers, p. 425.Google Scholar

Recassen, M., Costa-Jussà, J., and Fonollosa, A. R. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech and Language, 32: 3–10.Google Scholar

Sánchez-Martínez, F., and Way, A. 2009. Marker-based filtering of bilingual phrase pairs for SMT. In Proceedings of theEAMT 2009 the 13th Annual Meeting of the European Association for Machine Translation, Barcelona, Spain.Google Scholar

Singh, S. B., 2003. English-Hindi Translation Grammar. Delhi: Prabhat Prakashan. pp. 167–86.Google Scholar

Sumita, E., Lida, H., and Kohyama, H. 1990. Translating with examples: a new approach to machine translation. In Proceedings of the 3rd International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1990), Austin, TX.Google Scholar

Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Language Resources and Evaluation Conference.Google Scholar

Yang, M., and Zheng, J. 2009. Toward smaller, faster, and better hierarchical phrase-based SMT. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Suntec, Singapore.Google Scholar

Zens, R., Stanton, D., and Peng, X. 2012. A systematic comparison of Phrase Table pruning techniques. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar

Zhechev, V., and Genabith, J. 2010. Maximising TM performance through sub-tree alignment and SMT. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas.Google Scholar

Article contents

Efficient Phrase Table pruning for Hindi to English machine translation through syntactic and marker-based filtering and hybrid similarity measurement

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests