Hostname: page-component-77c78cf97d-9lb97 Total loading time: 0 Render date: 2026-04-25T01:09:35.157Z Has data issue: false hasContentIssue false

Building a multi-domain comparable corpus using a learning to rank method

Published online by Cambridge University Press:  15 June 2016

RAZIEH RAHIMI
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir
AZADEH SHAKERY
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
JAVID DADASHKARIMI
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir
MOZHDEH ARIANNEZHAD
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir
MOSTAFA DEHGHANI
Affiliation:
Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands e-mail: dehghani@uva.nl
HOSSEIN NASR ESFAHANI
Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir

Abstract

Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable