Hostname: page-component-76fb5796d-dfsvx Total loading time: 0 Render date: 2024-04-29T23:48:37.040Z Has data issue: false hasContentIssue false

Efficient Phrase Table pruning for Hindi to English machine translation through syntactic and marker-based filtering and hybrid similarity measurement

Published online by Cambridge University Press:  09 October 2018

NILADRI CHATTERJEE
Affiliation:
Department of Mathematics, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016India e-mails: niladri.iitd@gmail.com, susmita.chakraborty@gmail.com
SUSMITA GUPTA
Affiliation:
Department of Mathematics, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016India e-mails: niladri.iitd@gmail.com, susmita.chakraborty@gmail.com

Abstract

For a given training corpus of parallel sentences, the quality of the output produced by a translation system relies heavily on the underlying similarity measurement criteria. A phrase-based machine translation system derives its output through a generative process using a Phrase Table comprising source and target language phrases. As a consequence, the more effective the Phrase Table is, in terms of its size and the output that may be derived out of it, the better is the expected outcome of the underlying translation system. However, finding the most similar phrase(s) from a given training corpus that can help generate a good quality translation poses a serious challenge. In practice, often there are many parallel phrase entries in a Phrase Table that are either redundant, or do not contribute to the translation results effectively. Identifying these candidate entries and removing them from the Phrase Table will not only reduce the size of the Phrase Table, but should also help in improving the processing speed for generating the translations. The present paper develops a scheme based on syntactic structure and the marker hypothesis (Green 1979, The necessity of syntax markers: two experiments with artificial languages, Journal of Verbal Learning and Behavior) for reducing the size of a Phrase Table, without compromising much on the translation quality of the output, by retaining the non-redundant and meaningful parallel phrases only. The proposed scheme is complemented with an appropriate similarity measurement scheme to achieve maximum efficiency in terms of BLEU scores. Although designed for Hindi to English machine translation, the overall approach is quite general, and is expected to be easily adaptable for other language pairs as well.

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abdelali, A., Guzman, F., Sajjad, H., and Vogel, S. 2014. The amara corpus: building parallel language resources for the educational domain. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014).Google Scholar
Badr, G., Joanis, E., Larkin, S., and Kuhn, R. 2007. Manageable phrase-based statistical machine translation models. Computer Recognition Systems 2 (CORES 2007), pp. 437–44. Advances in Soft Computing. Wroclaw, Poland: Springer.Google Scholar
Balyan, R., and Chatterjee, N. 2015. Translating Noun Compounds using Semantic Relations, In Proceeding of the International Speech Communication Association of Journal, ISCA, vol. 32, no. 1, pp. 91–108.Google Scholar
Bharati, A., Sangal, R., and Sharma, D. 2007. SSF: Shakti standard format guide. Technical Report TR-LTRC-33, LTRC, IIIT-Hyderabad.Google Scholar
Bhargava, R., Baoni, A., Jain, H., and Sharma, Y. 2016. BITS_PILANI@DPIL-FIRE2016: Paraphrase Detection in Hindi Language using Syntactic Features of Phrase. In Proceedings of Forum of Information Retireval (FIRE-2016).Google Scholar
Bhattacharyya, P. 2010 IndoWordNet. In Proceedings of the Language Resources and Evaluation Conference (LREC-2010). Malta.Google Scholar
Bojar, O., Diatka, V., Rychly, P., Stranak, P., Suchomel, V., Tamchyna, A., and Zeman, D. 2014. Hindencorphindi-English and Hindi-only corpus for machine translation. In Proceedings of Language Resources and Evaluation Conference.Google Scholar
Bojar, O., Stranák, P., Zeman, D., Jain, G., and Damani, O. P. 2010. UMC002: English-Hindi parallel corpus. Data/software, UFAL, Charles University in Prague. http://ufal.mff.cuni.cz/umc/.Google Scholar
Cao, H., Finch, A., and Sumita, E., 2010. Syntactic constraints on phrase extraction for phrase-based machine translation. In Proceedings of the 4th Workshop Syntax and Structure in Statistical Translation (SSST-4), Beijing, China, pp. 2833.Google Scholar
Chatterjee, N. 2001. A statistical approach for similarity measurement between sentences for EBMT. In Proceedings of the Symposium on Translation Support Systems (STRANS-2001). Indian Institute of Technology. Kanpur.Google Scholar
Chooi-Ling, G., and Sumita, E., 2009. A feature-rich supervised word alignment model for phrase-based statistical machine translation. International Journal on Asian Language Processing 19 (3): 109125.Google Scholar
Deng, Y., Xu, J., and Gao, Y. 2008. Phrase Table training for precision and recall: what makes a good phrase and a good phrase pair? In Proceedings of the ACL/HLT 2008. Columbus, OH, USA.Google Scholar
Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. 2013. PPDB: the paraphrase database. In Proceedings of the Joint Human Language Technology Conference/Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL-2013), pp. 758–764.Google Scholar
Green, T. 1979. The necessity of syntax markers. two experiments with artificial languages. Journal of Verbal Learning and Behavior. 18: 481–96.Google Scholar
Imamura, K. 2002. Application of translation knowledge acquired by hierarchical phrase alignment for pattern-based MT. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
Johnson, H., Martin, J., Foster, G., and Kuhn, R. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL–2007), Prague, Czech Republic.Google Scholar
Koehn, P. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the 6th Biennial Conference of the Association for Machine Translation in the Americas, Washington, DC.Google Scholar
Koehn, P., Och, F., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, May 27–June 01, Edmonton, Canada.Google Scholar
Kunchukuttan, A., Mehta, P., Bhattacharyya, P. 2017. The IIT Bombay English-Hindi parallel corpus. In Proceedings of the Language Resources and Evaluation Conference (LREC-2018), Under review.Google Scholar
Ling, W., Graca, J., Trancoso, I., and Black, A. 2012. Entropy-based pruning for phrase-based machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar
Maruyama, H., and Watanabe, H. 1992. Tree cover search algorithm for example-based translation. In Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1992), Kyoto, Japan.Google Scholar
Nagao, M. 1984. A framework of a mechanical translation between Japanese and english by analogy principle. In Elithorn, A. and Banerji, R. (eds.), Artificial and Human Intelligence. Amsterdam: North Holland.Google Scholar
Nayak, T., Pal, S., Naskar, S., Bandyopadhyay, S., and Genabith, J., 2016. Beyond translation memories: generating translation suggestions based on parsing and POS tagging. In Proceedings of the 2nd Workshop on Natural Language Processing for Translation Memories, Portoroz, Slovenia, pp. 1220.Google Scholar
Niehues, J., and Waibel, A. 2012. Detailed analysis of different strategies for phrase table adaptation in SMT. In Proceedings of the American Machine Translation Association (AMTA-2012). San Diego, CA.Google Scholar
Och, F. J. 2003. Minimum error rate training for statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL-2003).Google Scholar
Och, F. J., Tillmann, C., and Ney, H. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint Conerence. of Empirical Methods in Natural Language Processing and Very Large Corpora.Google Scholar
Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL-2002).Google Scholar
Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. 2015. PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, vol. 2, Short papers, p. 425.Google Scholar
Recassen, M., Costa-Jussà, J., and Fonollosa, A. R. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech and Language, 32: 310.Google Scholar
Sánchez-Martínez, F., and Way, A. 2009. Marker-based filtering of bilingual phrase pairs for SMT. In Proceedings of theEAMT 2009 the 13th Annual Meeting of the European Association for Machine Translation, Barcelona, Spain.Google Scholar
Singh, S. B., 2003. English-Hindi Translation Grammar. Delhi: Prabhat Prakashan. pp. 167–86.Google Scholar
Sumita, E., Lida, H., and Kohyama, H. 1990. Translating with examples: a new approach to machine translation. In Proceedings of the 3rd International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1990), Austin, TX.Google Scholar
Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Language Resources and Evaluation Conference.Google Scholar
Yang, M., and Zheng, J. 2009. Toward smaller, faster, and better hierarchical phrase-based SMT. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Suntec, Singapore.Google Scholar
Zens, R., Stanton, D., and Peng, X. 2012. A systematic comparison of Phrase Table pruning techniques. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Google Scholar
Zhechev, V., and Genabith, J. 2010. Maximising TM performance through sub-tree alignment and SMT. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas.Google Scholar