Skip to main content
×
×
Home

Measuring bilingual corpus comparability

  • BO LI (a1), ERIC GAUSSIER (a2) and DAN YANG (a3)
Abstract

Comparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has been justified by several studies. However, researchers have not been able to establish a widely accepted and fully validated framework to measure corpus quality. We will thus investigate in this paper a comprehensive methodology to deal with the quality of comparable corpora. To be exact, we will propose several comparability measures and a quantitative strategy to test those measures. Our experiments show that the proposed comparability measure can capture gold-standard comparability levels very well and is robust to the bilingual dictionary used. Moreover, we will show in the task of bilingual lexicon extraction that the proposed measure correlates well with the performance of the real world application.

Copyright
References
Hide All
Abdul-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 16–23.
Bahdanau, D., Cho, K., and Bengio, Y., 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, pp. 115.
Ballesteros, L., and Croft, W. B., 1997. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proceedings of the 20th ACM SIGIR, Philadelphia, Pennsylvania, USA, pp. 8491.
Blei, A., and Jordan, I., 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3 : 9931022.
Boyd-Graber, J., and Blei, D. M. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI-2009) , pp. 75–82.
Chebel, M., Latiri, C., and Gaussier, E., 2017. Bilingual lexicon extraction from comparable corpora based on closed concepts mining. In Proceedings of the 21st Pacific-Asia Conference on Knowledge Discovery and Data Mining, Jeju, Korea, pp. 586598.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391407.
Deshmukh, A., and Hegde, G., 2012. A literature survey on latent semantic indexing. International Journal of Engineering Inventions 1 (4): 15.
Fung, P., and Yee, L. Y., 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Quebec, Canada, pp. 414–20.
Gaussier, E., Renders, J. M., Matveeva, I., Goutte, C., and Déjean, H. D., 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, pp. 526–33.
Hazem, A., and Morin, E., 2016. Efficient Data Selection for Bilingual Terminology Extraction from Comparable Corpora. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3401–11.
Hermann, K. M., and Blunsom, P., 2014. Multilingual Models for Compositional Distributional Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Maryland, USA, pp. 5868.
Hewavitharana, S., and Vogel, S. 2008. Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. In Proceedings of the LREC 2008 Workshop on Comparable Corpora.
Ji, H. 2009. Mining name translations from comparable corpora by creating bilingual information networks. In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC-2009), pp. 34–7.
Kilgarriff, A., 2001. Comparing corpora. International Journal of Corpus Linguistics 6 : 97133.
Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of MT Summit 2005.
Li, B., and Gaussier, E., 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 644–52.
Li, B., Gaussier, E., and Aizawa, A., 2011. Clustering comparable corpora for bilingual lexicon extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 473–8.
Luong, T., Pham, H., and Manning, C. D. 2015. Bilingual Word Representations with Monolingual Quality in Mind. In Proceedings of the NAACL Workshop on Vector Space Modeling for NLP.
Markantonatou, S., Sofianopoulos, S., Spilioti, V., Tambouratzis, G., Vassiliou, M., and Yannoutsou, O. 2006. Using patterns for machine translation. In Proceedings of the European Association for Machine Translation, pp. 239–46.
Mathieu, B., Besancon, R., and Fluhr, C. 2004. Multilingual document clusters discovery. In Proceedings of RIAO. pp. 116–25.
Morin, E., Daille, B., Takeuchi, K., and Kageura, K., 2007. Bilingual terminology mining - using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 664–71.
Munteanu, D. S., Fraser, A., and Marcu, A., 2004. Improved machine translation performance via parallel sentence extraction from comparable corpora. In Proceedings of the HLT-NAACL 2004, Boston, MA., USA, pp. 265–72.
Munteanu, D. S., and Marcu, D., 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–8.
Ni, X., Sun, J. T., Hu, J., and Chen, Z. 2009. Mining multilingual topics from wikipedia. In Proceedings of the 18th International Conference on World Wide Web. WWW ’09, pp. 1155–6.
Och, F. J., and Ney, H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.
Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–8.
Pekar, V., Mitkov, R., Blagoev, D., and Mulloni, A., 2006. Finding translations for low-frequency words in comparable corpora. Machine Translation 20 (4): 247–66.
Rapp, R., 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, pp. 519–26.
Rayson, P., and Garside, R. 2000. Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pp. 1–6.
Robitaille, X., Sasaki, Y., Tonoike, M., Sato, S., and Utsuro, T., 2006. Compiling French-Japanese terminologies from the web. In Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 225–32.
Salton, G., Wong, A., and Yang, C. S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 : 613–20.
Saralegi, X., SanVicente, I., and Gurrutxaga, A. 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In Proceedings of the 6th International Conference on Language Resources and Evaluations - Building and Using Comparable Corpora Workshop.
Schmid, H. 1995. Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop, pp. 47–50.
Shapiro, S. S., and Wilk, M. B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52 (3): 591611.
Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop, Louvain-la-Neuve.
Sharoff, S., Rapp, R., and Zweigenbaum, P. 2013. Overviewing Important Aspects of the Last Twenty Years of Research in Comparable Corpora. In Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.), Building and Using Comparable Corpora. Berlin: Springer-Verlag, pp. 117.
Skadina, I., Vasiljevs, A., Skadins, R., Gaizauskas, R., Tufis, D., and Gornostay, T. 2010. Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (LREC-2010), pp. 6–14.
Talvensaari, T., Laurikkala, J., Järvelin, L., Juhola, M., and Keskustalo, H., 2007. Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems 25 (1): 4.
Upadhyay, S., Faruqui, M., Dyer, C., and Roth, D., 2016. Cross-lingual models of word embeddings: an empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 16611670.
Vulic, I., and Moens, M. F. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pp. 719–725.
Washtell, J. 2009. Co-dispersion: a windowless approach to lexical association. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 861–9.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 8
Total number of PDF views: 63 *
Loading metrics...

Abstract views

Total abstract views: 411 *
Loading metrics...

* Views captured on Cambridge Core between 15th January 2018 - 22nd August 2018. This data will be updated every 24 hours.