Skip to main content Accessibility help

A classification approach for detecting cross-lingual biomedical term translations

  • H. HAKAMI (a1) and D. BOLLEGALA (a2)


Finding translations for technical terms is an important problem in machine translation. In particular, in highly specialized domains such as biology or medicine, it is difficult to find bilingual experts to annotate sufficient cross-lingual texts in order to train machine translation systems. Moreover, new terms are constantly being generated in the biomedical community, which makes it difficult to keep the translation dictionaries up to date for all language pairs of interest. Given a biomedical term in one language (source language), we propose a method for detecting its translations in a different language (target language). Specifically, we train a binary classifier to determine whether two biomedical terms written in two languages are translations. Training such a classifier is often complicated due to the lack of common features between the source and target languages. We propose several feature space concatenation methods to successfully overcome this problem. Moreover, we study the effectiveness of contextual and character n-gram features for detecting term translations. Experiments conducted using a standard dataset for biomedical term translation show that the proposed method outperforms several competitive baseline methods in terms of mean average precision and top-k translation accuracy.



Hide All
Baroni, M., and Lenci, A. 2010. Distributional memory: a general framework for corpus-based semantics. Computational Linguistics 36 (4): 673721.
Bollegala, D., Maehara, T., and ichi Kawarabayashi, K., 2015. Embedding semantic relations into word representations. In Proceedings of IJCAI, Buenos Aires, Argentina: AAAI, pp. 1222–8.
Bollegala, D., Matsuo, Y., and Ishizuka, M., 2007. An integrated approach to measuring semantic similarity between words using information available on the web. In Proceedings of HTL-NAACL’07, Rochester, NY: ACL, pp. 340–7.
Boström, H. 2007. Estimating class probabilities in random forests. In International Conference on Machine Learning and Applications, pp. 211–6.
Breiman, L. 2001. Random forests. Machine Learning 45 (1): 532.
Chan, Y. S., and Ng, H. T. 2005. Word sense disambiguation with distribution estimation. In IJCAI’05, pp. 1010–5.
Chiao, Y.-C., and Zweigenbaum, P., 2002. Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan: ACL, pp. 15.
Claveau, V., 2008. Automatic translation of biomedical terms by supervised machine learning. In Proceedings of LREC, Marrakech, Morocco: European Language Resources Association, pp. 684–91.
Clopper, C. J., and Pearson, E. S. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26 (4): 404–13.
Dias, G., Moraliyski, R., Cordeiro, J., Doucet, A., and Ahonen-Myka, H. 2010. Automatic discovery of word semantic relations using paraphrase alignment and distributional lexical semantics analysis. Natural Language Engineering 16 (4): 439–67.
Díaz-Uriarte, R., and De Andres, S. A. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7 (1): 113.
Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2009. Improving the extraction of bilingual terminology from wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5 (4): 131.
Fan, J.-W., and Friedman, C. 2007. Semantic classification of biomedical concepts using distributional similarity. Journal of the American Medical Informatics Association 14 (4): 467–77.
Kontonatsios, G., Korkontzelos, I., Tsujii, J., and Ananiadou, S., 2014a. Combining string and context similarity for bilingual term extraction from comparable corpora. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: ACL, pp. 1701–12.
Kontonatsios, G., Korkontzelos, I., Tsujii, J., and Ananiadou, S., 2014b. Using a random forest classifier to compile bilingual dictionaries of technical terms from comparable corpora. In Proceedings of the European Chapter for the Association for Computational Linguistics (ACL), Gothenburg, Sweden: ACL, pp. 111–6.
Lin, D. 1998. Automatic retrieval and clustering of similar words. In ACL 1998, pp. 768–74.
Mcnamee, P., and Mayfield, J. 2004. Character n-gram tokenization for european language text retrieval. Information Retrieval 7 (1–2): 7397.
Mikolov, T., Chen, K., and Dean, J. 2013a. Efficient estimation of word representation in vector space. CoRR abs/1301.3781.
Mikolov, T., Tau Yih, W., and Zweig, G. 2013b. Linguistic regularities in continous space word representations. In NAACL’13, pp. 746–51.
Mitchell, J., and Lapata, M. 2008. Vector-based models of semantic composition. In ACL-HLT’08, pp. 236–44.
Nakov, P., and Tiedemann, J., 2012. Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of Annual Meeting of the Association for Computational Linguistics (short-papers), Jeju Island, South Korea: ACL, pp. 301–5.
Namer, F., and Baud, R., 2005. Predicting lexical relations between biomedical terms: towards a multilingual morphosemantics-based system. Studies in Health Technology and Informatics 116 : 793–8.
Rapp, R., 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, Maryland, USA: ACL, pp. 519–26.
Rapp, R. 2008. The automatic generation of thesauri of related words for english, french, german, and russian. International Journal of Speech Technology 11 (3–4): 147–56.
Saralegi, X., San Vicente, I., and Gurrutxaga, A., 2008. Automatic extraction of bilingual terms from comparable corpora in a popular science domain. In Proceedings of Building and using Comparable Corpora Workshop, Marrakech, Morocco, pp. 2732.
Tiedemann, J., 2012. Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France: ACL, pp. 141–51.
Tiedemann, J., and Nakov, P., 2013. Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria: INCOMA, pp. 676–84.
Turney, P. D., and Pantel, P., 2010. From frequency to meaning: vector space models of semantics. Journal of Aritificial Intelligence Research 37 : 141–88.
Vilar, D., Peter, J.-T., and Ney, H., 2007. Can we translate letters?. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic: ACL, pp. 33–9.
Weeds, J., Dowdall, J., Schneider, G., Keller, B., and Weir, D. 2007. Using distributional similarity to organise biomedical terminology. Application-Driven Terminology Engineering 2 (97): 107–41.
Xi, N., Tang, G., Dai, X., Huang, S., and Chen, J. 2012. Enhancing statistical machine translation with character alignment. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Korea: ACL, 2: 285–90.

Related content

Powered by UNSILO

A classification approach for detecting cross-lingual biomedical term translations

  • H. HAKAMI (a1) and D. BOLLEGALA (a2)


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.