Skip to main content
×
×
Home

Efficient Phrase Table pruning for Hindi to English machine translation through syntactic and marker-based filtering and hybrid similarity measurement

  • NILADRI CHATTERJEE (a1) and SUSMITA GUPTA (a1)
Abstract

For a given training corpus of parallel sentences, the quality of the output produced by a translation system relies heavily on the underlying similarity measurement criteria. A phrase-based machine translation system derives its output through a generative process using a Phrase Table comprising source and target language phrases. As a consequence, the more effective the Phrase Table is, in terms of its size and the output that may be derived out of it, the better is the expected outcome of the underlying translation system. However, finding the most similar phrase(s) from a given training corpus that can help generate a good quality translation poses a serious challenge. In practice, often there are many parallel phrase entries in a Phrase Table that are either redundant, or do not contribute to the translation results effectively. Identifying these candidate entries and removing them from the Phrase Table will not only reduce the size of the Phrase Table, but should also help in improving the processing speed for generating the translations. The present paper develops a scheme based on syntactic structure and the marker hypothesis (Green 1979, The necessity of syntax markers: two experiments with artificial languages, Journal of Verbal Learning and Behavior) for reducing the size of a Phrase Table, without compromising much on the translation quality of the output, by retaining the non-redundant and meaningful parallel phrases only. The proposed scheme is complemented with an appropriate similarity measurement scheme to achieve maximum efficiency in terms of BLEU scores. Although designed for Hindi to English machine translation, the overall approach is quite general, and is expected to be easily adaptable for other language pairs as well.

Copyright
References
Hide All
Abdelali, A., Guzman, F., Sajjad, H., and Vogel, S. 2014. The amara corpus: building parallel language resources for the educational domain. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014).
Badr, G., Joanis, E., Larkin, S., and Kuhn, R. 2007. Manageable phrase-based statistical machine translation models. Computer Recognition Systems 2 (CORES 2007), pp. 437–44. Advances in Soft Computing. Wroclaw, Poland: Springer.
Balyan, R., and Chatterjee, N. 2015. Translating Noun Compounds using Semantic Relations, In Proceeding of the International Speech Communication Association of Journal, ISCA, vol. 32, no. 1, pp. 91–108.
Bharati, A., Sangal, R., and Sharma, D. 2007. SSF: Shakti standard format guide. Technical Report TR-LTRC-33, LTRC, IIIT-Hyderabad.
Bhargava, R., Baoni, A., Jain, H., and Sharma, Y. 2016. BITS_PILANI@DPIL-FIRE2016: Paraphrase Detection in Hindi Language using Syntactic Features of Phrase. In Proceedings of Forum of Information Retireval (FIRE-2016).
Bhattacharyya, P. 2010 IndoWordNet. In Proceedings of the Language Resources and Evaluation Conference (LREC-2010). Malta.
Bojar, O., Diatka, V., Rychly, P., Stranak, P., Suchomel, V., Tamchyna, A., and Zeman, D. 2014. Hindencorphindi-English and Hindi-only corpus for machine translation. In Proceedings of Language Resources and Evaluation Conference.
Bojar, O., Stranák, P., Zeman, D., Jain, G., and Damani, O. P. 2010. UMC002: English-Hindi parallel corpus. Data/software, UFAL, Charles University in Prague. http://ufal.mff.cuni.cz/umc/.
Cao, H., Finch, A., and Sumita, E., 2010. Syntactic constraints on phrase extraction for phrase-based machine translation. In Proceedings of the 4th Workshop Syntax and Structure in Statistical Translation (SSST-4), Beijing, China, pp. 2833.
Chatterjee, N. 2001. A statistical approach for similarity measurement between sentences for EBMT. In Proceedings of the Symposium on Translation Support Systems (STRANS-2001). Indian Institute of Technology. Kanpur.
Chooi-Ling, G., and Sumita, E., 2009. A feature-rich supervised word alignment model for phrase-based statistical machine translation. International Journal on Asian Language Processing 19 (3): 109125.
Deng, Y., Xu, J., and Gao, Y. 2008. Phrase Table training for precision and recall: what makes a good phrase and a good phrase pair? In Proceedings of the ACL/HLT 2008. Columbus, OH, USA.
Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. 2013. PPDB: the paraphrase database. In Proceedings of the Joint Human Language Technology Conference/Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL-2013), pp. 758–764.
Green, T. 1979. The necessity of syntax markers. two experiments with artificial languages. Journal of Verbal Learning and Behavior. 18: 481–96.
Imamura, K. 2002. Application of translation knowledge acquired by hierarchical phrase alignment for pattern-based MT. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Johnson, H., Martin, J., Foster, G., and Kuhn, R. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL–2007), Prague, Czech Republic.
Koehn, P. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the 6th Biennial Conference of the Association for Machine Translation in the Americas, Washington, DC.
Koehn, P., Och, F., and Marcu, D. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, May 27–June 01, Edmonton, Canada.
Kunchukuttan, A., Mehta, P., Bhattacharyya, P. 2017. The IIT Bombay English-Hindi parallel corpus. In Proceedings of the Language Resources and Evaluation Conference (LREC-2018), Under review.
Ling, W., Graca, J., Trancoso, I., and Black, A. 2012. Entropy-based pruning for phrase-based machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
Maruyama, H., and Watanabe, H. 1992. Tree cover search algorithm for example-based translation. In Proceedings of the 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1992), Kyoto, Japan.
Nagao, M. 1984. A framework of a mechanical translation between Japanese and english by analogy principle. In Elithorn, A. and Banerji, R. (eds.), Artificial and Human Intelligence. Amsterdam: North Holland.
Nayak, T., Pal, S., Naskar, S., Bandyopadhyay, S., and Genabith, J., 2016. Beyond translation memories: generating translation suggestions based on parsing and POS tagging. In Proceedings of the 2nd Workshop on Natural Language Processing for Translation Memories, Portoroz, Slovenia, pp. 1220.
Niehues, J., and Waibel, A. 2012. Detailed analysis of different strategies for phrase table adaptation in SMT. In Proceedings of the American Machine Translation Association (AMTA-2012). San Diego, CA.
Och, F. J. 2003. Minimum error rate training for statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL-2003).
Och, F. J., Tillmann, C., and Ney, H. 1999. Improved alignment models for statistical machine translation. In Proceedings of the Joint Conerence. of Empirical Methods in Natural Language Processing and Very Large Corpora.
Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL-2002).
Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. 2015. PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, vol. 2, Short papers, p. 425.
Recassen, M., Costa-Jussà, J., and Fonollosa, A. R. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech and Language, 32: 310.
Sánchez-Martínez, F., and Way, A. 2009. Marker-based filtering of bilingual phrase pairs for SMT. In Proceedings of theEAMT 2009 the 13th Annual Meeting of the European Association for Machine Translation, Barcelona, Spain.
Singh, S. B., 2003. English-Hindi Translation Grammar. Delhi: Prabhat Prakashan. pp. 167–86.
Sumita, E., Lida, H., and Kohyama, H. 1990. Translating with examples: a new approach to machine translation. In Proceedings of the 3rd International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1990), Austin, TX.
Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Language Resources and Evaluation Conference.
Yang, M., and Zheng, J. 2009. Toward smaller, faster, and better hierarchical phrase-based SMT. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Suntec, Singapore.
Zens, R., Stanton, D., and Peng, X. 2012. A systematic comparison of Phrase Table pruning techniques. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
Zhechev, V., and Genabith, J. 2010. Maximising TM performance through sub-tree alignment and SMT. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed