Skip to main content
×
×
Home

Building a multi-domain comparable corpus using a learning to rank method

  • RAZIEH RAHIMI (a1), AZADEH SHAKERY (a1) (a2), JAVID DADASHKARIMI (a1), MOZHDEH ARIANNEZHAD (a1), MOSTAFA DEHGHANI (a3) and HOSSEIN NASR ESFAHANI (a1)...
Abstract

Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.

Copyright
Footnotes
Hide All

This research was in part supported by a grant from Institute for Research in Fundamental Sciences (No. CS1393-4-43).

Footnotes
References
Hide All
AbduI-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL'09, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 16–23.
Agirre, E., Di Nunzio, G. M., Ferro, N., Mandl, T., and Peters, C. 2009. Clef 2008: ad hoc track overview. In Proceedings of the 9th Cross-language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, CLEF'08, Berlin, Heidelberg, Springer-Verlag, pp. 15–37.
Aker, A., Kanoulas, E. and Gaizauskas, R. 2012. A light way to collect comparable corpora from the web. In Chair, N. C. C., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. European Language Resources Association (ELRA).
Aker, A., Paramita, M. and Gaizauskas, R. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Soa, Bulgaria: Association for Computational Linguistics, pp. 402–411.
Aker, A., Paramita, M. L., Pinnis, M. and Gaizauskas, R. 2014. Bilingual dictionaries for all EU languages. In The Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA).
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., and Oroumchian, F. 2009. Hamshahri: a standard Persian text collection. Knowledge Based Systems 22 (5): 382387.
Azarbonyad, H., Shakery, A. and Faili, H. 2012. Using learning to rank approach for parallel corpora based cross-language information retrieval. In Proceedings of 20th European Conference on Artificial Intelligence (ECAI), Montpellier, France, pp. 79–84.
Braschler, M. and Schäuble, P. 1998. Multilingual information retrieval based on document alignment techniques. In Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, ECDL'98, London, UK: Springer-Verlag, pp. 183–197.
Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning 20 (3): 273297.
Dadashkarimi, J., Shakery, A. and Heshaam, F. 2014. A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In Proceedings of the 3th Conference on Computational Linguistic, CLConference'14, Tehran, Iran.
Darwish, K. and Oard, D. W. 2003. Probabilistic structured query methods. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR'03, New York, NY, USA. ACM, pp. 338–344.
Fang, H., Tao, T. and Zhai, C. (2011). Diagnostic evaluation of information retrieval models. ACM Transactions on Information Systems 29 (2): 7:17:42.
Ferro, N. and Peters, C. 2009. Clef 2009 ad hoc track overview: TEL and Persian tasks. In Proceedings of the 10th Cross-Language Evaluation Forum Conference on Multilingual Information Access Evaluation: Text Retrieval Experiments, CLEF'09, Berlin, Heidelberg, Springer-Verlag, pp. 13–35.
Finkel, J. R., Grenager, T. and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL'05, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 363–370.
Fung, P. and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics, COLING'04, Stroudsburg, PA, USA. Association for Computational Linguistics.
Garera, N., Callison-Burch, C., and Yarowsky, D. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL'09, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 129–137.
Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL'04, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 527–534.
Hashemi, H. B. and Shakery, A. 2014. Mining a Persian-English comparable corpus for cross-language information retrieval. Information Processing and Management 50 (2): 384398.
Hashemi, H. B., Shakery, A. and Faili, H. 2010. Creating a Persian-English comparable corpus. In Proceedings of the 2010 International Conference on Multilingual and Multimodal Information Access Evaluation: Cross-Language Evaluation Forum, CLEF'10, Berlin, Heidelberg, Springer-Verlag, pp. 27–39.
Huang, D., Zhao, L., Li, L. and Yu, H. 2010. Mining large-scale comparable corpora from Chinese-English news collections. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. COLING'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 472–480.
Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'02, New York, NY, USA. ACM, pp. 133–142.
Li, H. and Hirst, G. 2011. Learning to Rank for Information Retrieval and Natural Language Processing. G - Reference, Information and Interdisciplinary Subjects Series. California, USA: Morgan & Claypool Publishers.
McNamee, P., Mayfield, J., and Nicholas, C. 2009. Translation corpus source and size in bilingual retrieval. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short'09 Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 25–28.
Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477504.
Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 81–88.
Nie, J.-Y. 2010. Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies. California, USA: Morgan & Claypool Publishers.
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.
Pal, S., Pakray, P. and Naskar, K. S. 2014. Automatic Building and Using Parallel Resources for SMT from Comparable Corpora. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra) at EACL. Association for Computational Linguistics, pp. 48–57.
Paramita, M. L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., and Sanderson, M. 2013. Methods for collection and evaluation of comparable documents. In Sharoff, S., Rapp, R., Zweigenbaum, P., and Fung, P. (eds), Building and Using Comparable Corpora, pp. 93112. Berlin Heidelberg: Springer.
Pilevar, M. T., Faili, H. and Pilevar, A. H. 2011. TEP: Tehran English-Persian parallel corpus. In Proceedings of the 12th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II, CICLing'11, Berlin, Heidelberg, Springer-Verlag, pp. 68–79.
Pomikálek, J. 2011. Removing boilerplate and duplicate content from web corpora. PhD thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic.
Rahimi, R. and Shakery, A. 2013. A language modeling approach for extracting translation knowledge from comparable corpora. In Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR'13, Berlin, Heidelberg. Springer-Verlag, pp. 606–617.
Rahimi, R., Shakery, A. and King, I. 2016. Extracting translations from comparable corpora for cross-language information retrieval using the language modeling framework. Information Processing & Management 52 (2): 299318.
Rahimi, Z. and Shakery, A. 2011. Topic based creation of a Persian-English comparable corpus. In Proceedings of the 7th Asia Conference on Information Retrieval Technology, AIRS'11, Berlin, Heidelberg, Springer-Verlag, pp. 458–469.
Saad, M., Langlois, D. and Smaïli, K. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia - Social and Behavioral Sciences 95: 4047.
Sheridan, P. and Ballerini, J. P. 1996. Experiments in multilingual information retrieval using the spider system. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'96, New York, NY, USA. ACM, pp. 58–65.
Skadia, I., Aker, A., Mastropavlos, N., Su, F., Tufi, D., Verlic, M., Vasijevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. L., and Pinnis, M. (2012). Collecting and using comparable corpora for statistical machine translation. In Chair, N. C. C., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey: European Language Resources Association 824 (ELRA).
Smith, J. R., Quirk, C. and Toutanova, K. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 403–411.
Strötgen, J., Gertz, M. and Junghans, C. 2011. An event-centric model for multilingual document similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 953–962.
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., and Keskustalo, H. 2007. Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems 25 (4), 1.
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., and Laurikkala, J. 2008. Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11 (5): 427445.
Tao, T. and Zhai, C. 2005. Mining comparable bilingual text corpora for cross-language information integration. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD'05, New York, NY, USA: ACM, pp. 691–696.
Ture, F., Elsayed, T. and Lin, J. 2011. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 943–952.
Vulić, I. and Moens, M.-F. 2012. Detecting highly confident word translations from comparable corpora without any prior knowledge. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL'12, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 449–459.
Xu, J. and Weischedel, R. 2005. Empirical studies on the impact of lexical resources on CLIR performance. Information Processing & Management 41 (3): 475487.
Zhai, C. and Lafferty, J. 2001. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th International Conference on Information and Knowledge Management, CIKM'01, New York, NY, USA, ACM, pp. 403–410.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 9
Total number of PDF views: 58 *
Loading metrics...

Abstract views

Total abstract views: 743 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 18th June 2018. This data will be updated every 24 hours.