Skip to main content
×
×
Home

Leveraging bilingual terminology to improve machine translation in a CAT environment*

  • MIHAEL ARCAN (a1), MARCO TURCHI (a2), SARA TONELLI (a2) and PAUL BUITELAAR (a1)
Abstract
Abstract

This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality. Therefore, we investigate several strategies to extract and align terminology across languages and to integrate it in an SMT system. We compare two terminology injection methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and cache-based model. We test the cache-based model on two different domains (information technology and medical) in English, Italian and German, showing significant improvements ranging from 2.23 to 6.78 BLEU points over a baseline SMT system and from 0.05 to 3.03 compared to the widely-used XML markup approach.

Copyright
Footnotes
Hide All
*

This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight).

Footnotes
References
Hide All
Aker A., Paramita M., and Gaizauskas R., 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 402–11.
Arcan M., Federmann C., and Buitelaar P., 2012. Experiments with term translation. In Proceedings of the 24th International Conference on Computational Linguistics, Mumbai, India, pp. 6782.
Arcan M., Giuliano C., Turchi M., and Buitelaar P., 2014a. Identification of bilingual terms from monolingual documents for statistical machine translation. In Proceedings of the 4th International Workshop on Computational Terminology (Computerm), Dublin, Ireland, pp. 2231.
Arcan M., Turchi M., Tonelli S., and Buitelaar P., 2014b. Enhancing statistical machine translation with bilingual terminology in a CAT environment. In Association for Machine Translation in the Americas (AMTA), Vancouver, Canada, pp. 5468.
Arcan M., McCrae J. P., and Buitelaar P., 2016. Expanding wordnets to new languages with multilingual sense disambiguation. In International Conference on Computational Linguistics (COLING), Osaka, Japan, pp. 97108.
Bentivogli L., Bertoldi N., Cettolo M., Federico M., Negri M., and Turchi M., 2016. On the evaluation of adaptive machine translation for human post-editing. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (2): 388–99.
Bertoldi N., and Federico M., 2009. Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the 4th Workshop on Statistical Machine Translation, Athens, Greece, pp. 182–9.
Bertoldi N., Haddow B., and Fouet J.-B., 2009. Improved minimum error rate training in moses. Prague Bulletin of Mathematical Linguistics 91 : 716.
Bertoldi N., Cettolo M., and Federico M., 2013. Cache-based online adaptation for machine translation enhanced computer assisted translation. In Proceedings of Machine Translation Summit XIV, Nice, France, pp. 3542.
Bouamor D., Semmar N., and Zweigenbaum P., 2011. Improved statistical machine translation using multiword expressions. In Proceedings of the International Workshop on Using Linguistic Information for Hybrid Machine Translation (LIHMT 2011), Barcelona, Spain, pp. 1520.
Bouamor D., Semmar N., and Zweigenbaum P., 2012. Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, pp. 674–9.
Clark J. H., Dyer C., Lavie A., and Smith N. A., 2011. Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, pp. 176–81.
Daille B., Gaussier E., and Langé J.-M., 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 515–21.
Denkowski M., Dyer C., and Lavie A., 2014. Learning from post-editing: online model adaptation for statistical machine translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 395404.
Dice L. R., 1945. Measures of the amount of ecologic association between species. Ecology 26 (3): 297302.
Ehrmann M., Turchi M., and Steinberger R., 2011. Building a multilingual named entity-annotated corpus using annotation projection. In Recent Advances in Natural Language Processing, (RANLP), Hissar, Bulgaria, pp. 118–24.
Federico M., Cattelan A., and Trombetti M., 2012. Measuring user productivity in machine translation enhanced computer assisted translation. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas, San Diego, California, pp. 4456.
Federico M., Bertoldi N., Cettolo M., Negri M., Turchi M., Trombetti M., Cattelan A., Farina A., Lupinetti D., Martines A., Massidda A., Schwenk H., Barrault L., Blain F., Koehn P., Buck C., and Germann U., 2014. The MateCat tool. In Proceedings of 25th International Conference on Computational Linguistics: System Demonstrations (COLING), Dublin, Ireland, pp. 129–32.
Green S., Heer J., and Manning C. D., 2013. The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, pp. 439–48.
Haddow B., and Koehn P., 2012. Analysing the effect of out-of-domain data on SMT systems. In Proceedings of the 7th Workshop on Statistical Machine Translation, Montréal, Canada, pp. 422–32.
Heyn M., 1996. Integrating machine translation into translation memory systems. In Proceedings of the EAMT Machine Translation Workshop, TKE’96, Vienna, Austria, pp. 113–26.
Itagaki M., and Aikawa T., 2008. Post-MT term swapper: supplementing a statistical machine translation system with a user dictionary. In Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 1584–8.
Kim S. N., Baldwin T., and Kan M.-Y., 2009. An unsupervised approach to domain-specific term extraction. In Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 94–8.
Kim S. N., Medelyan O., Kan M.-Y., and Baldwin T., 2010. Semeval-2010 task 5: automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 21–6.
Koehn P., 2005. Europarl: a parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 7986.
Koehn P., Hoang H., Birch A., Callison-Burch C., Federico M., Bertoldi N., Cowan B., Shen W., Moran C., Zens R., Dyer C., Bojar O., Constantin A., and Herbst E., 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Prague, Czech Republic, pp. 177–80.
Läubli S., Fishel M., Massey G., Ehrensberger-Dow M., and Volk M., 2013. Assessing post-editing efficiency in a realistic translation environment. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice, Nice, France, pp. 8391.
Levenberg A., Callison-Burch C., and Osborne M., 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Los Angeles, California, pp. 394402.
Och F. J., and Ney H., 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.
Okita T., and Way A., 2010. Statistical machine translation with terminology. In Proceedings of the First Symposium on Patent Information Processing (SPIP), Tokyo, Japan, pp. 18.
Papineni K., Roukos S., Ward T., and Zhu W.-Z., 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 311–8.
Pianta E., and Tonelli S., 2010. KX: a flexible system for Keyphrase eXtraction. In Proceedings of SemEval 2010, Task 5: Keyword extraction from Scientific Articles, Uppsala, Sweden, pp. 170–3.
Pinnis M., 2015. Dynamic terminology integration methods in statistical machine translation. In Proceedings of the 18th Annual Conference of the European Association for Machine Translation (EAMT 2015), Antalya, Turkey, pp. 8996.
Pinnis M., and Skadins R., 2012. MT adaptation for under-resourced domains - what works and what not. In Proceedings of the 5th International Conference Baltic Human Language Technologies - The Baltic Perspective, Tartu, Estonia, pp. 176–84.
Pinnis M., Ljubešić N., Ştefănescu D., Skadiņa I., Tadić M., and Gornostay T., 2012. Term extraction, tagging, and mapping tools for under-resourced languages. In Proceedings of the Terminology and Knowledge Engineering (TKE2012) Conference, Jeju Island, Korea, pp. 91–6.
Ren Z., Y., Cao J., Liu Q., and Huang Y. 2009. Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, Singapore, pp. 4754.
Salton G., Wong A., and Yang C.-S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 613–20.
Sparck Jones K., 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28 (1): 1121.
Steinberger R., Pouliquen B., Widiger A., Ignat C., Erjavec T., Tufis D., and Varga D., 2006. The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 2142–7.
Stolcke A., 2002. SRILM-an extensible language modeling toolkit. In Proceedings International Conference on Spoken Language Processing, Denver, USA, pp. 901–4.
Thurmair G. and Aleksić V., 2012. Creating term and lexicon entries from phrase tables. In Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, Italy, pp. 253–60.
Tiedemann J., 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In Proceeding of Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp. 237–48.
Vintar S., and Fišer D., 2008. Harvesting multi-word expressions from parallel corpora. In Proceedings of European Language Resources Association, Marrakech, Morocco, pp. 1091–6.
Weller M., Fraser A., and Heid U., 2014. Combining bilingual terminology mining and morphological modeling for domain adaptation in SMT. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, Dubrovnik, Croatia, pp. 11–8.
Wu C.-C., and Chang J. S. 2004. Bilingual collocation extraction based on syntactic and statistical analyses. In Proceedings of the 15th Conference on Computational Linguistics and Speech Processing, Taiwan, pp. 120.
Xiong D., Meng F., and Liu Q., 2016. Topic-based term translation models for statistical machine translation. Artificial Intelligence 232 : 5475.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 6
Total number of PDF views: 60 *
Loading metrics...

Abstract views

Total abstract views: 364 *
Loading metrics...

* Views captured on Cambridge Core between 30th May 2017 - 18th December 2017. This data will be updated every 24 hours.