Hostname: page-component-77f85d65b8-8wtlm Total loading time: 0 Render date: 2026-03-30T06:24:14.038Z Has data issue: false hasContentIssue false

Evaluating optimal reference translations

Published online by Cambridge University Press:  08 May 2024

Vilém Zouhar*
Affiliation:
Department of Computer Science, ETH Zürich, Zürich, Switzerland Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czechia
Věra Kloudová
Affiliation:
Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czechia
Martin Popel
Affiliation:
Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czechia
Ondřej Bojar
Affiliation:
Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czechia
*
Corresponding author: Vilém Zouhar; Email: vilem.zouhar@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this paper, we propose a methodology for creating more reliable document-level human reference translations, called “optimal reference translations,” with the simple aim to raise the bar of what should be deemed “human translation quality.” We evaluate the obtained document-level optimal reference translations in comparison with “standard” ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Figure 1. Example translations of the same source into Czech. Literal transcriptions of the translations are shown in italics. N1: translatologist collaboration (optimal translation), P1: professional translation agency (post-edited MT), P2, P3: professional translation agency.

Figure 1

Figure 2. First 5 rows of a screen for a single document with source and 4 translations in parallel. Screens were accessed by annotators in an online spreadsheet programme. Note: Scalable graphics—zoom in.

Figure 2

Figure 3. Averages of ratings for different translation sources on document (top-left) and segment (bottom-right) level across features.

Figure 3

Figure 4. Distribution densities of ratings of each collected variable (thin tail cropped $\geq$ 3 for higher resolution of high-density values). Numbers and horizontal lines show feature means.

Figure 4

Figure 5. Pearson’s correlations between individual features on document (top-left) and segment (bottom-right) level.

Figure 5

Figure 6. Predictions of linear regression models (on document and segment level) for all test set items sorted by true Overall score. Formulas show fitted coefficients and Pearson’s correlations with the true scores. Only a random subset of points shown for visibility.

Figure 6

Figure 7. Segment-level Pearson’s correlations between the collected scores and automated metrics between the original and edited versions of a segment. Colour is based on absolute value of the correlation (note TER).j

Figure 7

Figure 8. Distribution densities of ratings of Overall for individual annotators.

Figure 8

Figure 9. Pearson’s correlations of predictions from segment-level aggregations to document-level scores. For example, for the Overall category with min aggregation: $\rho (\{d^{\text{Overall}}\,:\, d \in \mathcal{D}\}, \{\min \{s^{\text{Overall}}\,:\, s \in d\}\,:\, d \in \mathcal{D}\})$.

Figure 9

Figure 10. Scores of subset of categories for a selected segment from the translation P3 by annotator A1 (translator) and annotators A{2,3,4} (non-translators).