Hostname: page-component-6766d58669-7fx5l Total loading time: 0 Render date: 2026-05-21T19:37:49.091Z Has data issue: false hasContentIssue false

Extraction of multi-word expressions from small parallel corpora

Published online by Cambridge University Press:  21 March 2012

YULIA TSVETKOV
Affiliation:
Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA, USA e-mail: yulia.tsvetkov@gmail.com
SHULY WINTNER
Affiliation:
Department of Computer Science University of Haifa, Hafia, Israel e-mail: shuly@cs.haifa.ac.il

Abstract

We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naïve alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable