Hostname: page-component-77f85d65b8-v2srd Total loading time: 0 Render date: 2026-03-28T22:13:23.804Z Has data issue: false hasContentIssue false

Calibration and context in human evaluation of machine translation

Published online by Cambridge University Press:  03 June 2024

Rebecca Knowles*
Affiliation:
National Research Council of Canada, Ottawa, ON, Canada
Chi-kiu Lo
Affiliation:
National Research Council of Canada, Ottawa, ON, Canada
*
Corresponding author: Rebecca Knowles; Email: Rebecca.Knowles@nrc-cnrc.gc.ca
Rights & Permissions [Opens in a new window]

Abstract

Human evaluation of machine translation is considered the “gold standard” for evaluation, but it remains a challenging task for which to define best practices. Recent work has focused on incorporating intersentential context into human evaluation, to better distinguish between high-performing machine translation systems and human translations. In this work, we examine several ways that such context influences evaluation and evaluation protocols. We take a close look at annotator variation through the lens of calibration sets and focus on the implications for context-aware evaluation protocols. We then demonstrate one way in which degraded target-side intersentential context can influence annotator scores of individual sentences, a finding that supports the context-aware approach to evaluation and which also has implications for best practices in evaluation protocols.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use and/or adaptation of the article.
Copyright
© Crown Copyright - National Research Council of Canada, 2024. Published by Cambridge University Press
Figure 0

Table 1. Basic calibration HIT statistics

Figure 1

Table 2. Basic annotator statistics

Figure 2

Figure 1. Variations in annotator score distributions and how annotators use the 0–100 scale, shown for a subset of the English–Japanese calibration HITs. Figures along the diagonal show histograms of annotator scores, while the off-diagonal figures show a comparison between two annotators’ scores, with each point representing a segment (its x-value determined by the score given to it by the annotator for that column and its y-value determined by the score given to it by the annotator for that row).

Figure 3

Figure 2. Annotator score histograms across calibration HITs and additional HITs completed (“bad references” omitted), for English–Japanese (same subset of annotators as shown in Figure 1). The x-axis is the score and the y-axis is the count of segments that were assigned that score.

Figure 4

Table 3. Pairwise annotator correlations

Figure 5

Figure 3. A comparison of calibration means (x-axis), and two groups of other means (y-axis): annotator-level means (dots) and HIT-level means (x marks) for English–Japanese.

Figure 6

Table 4. Example quality control document pair

Figure 7

Figure 4. Density histograms showing the shift in score differences between repeat segments observed before encountering any “bad references” (blue/unhatched) and those after encountering “bad references” (red/hatched) for the 2022 data. The horizontal axis shows the score difference (original segment score minus the score in the quality control portion), while the vertical axis shows the fraction of pairs with that score difference. In both the before and after cases, we still observe that the most common difference is 0, but the rightward shift and thicker right tail in the red/hatched after-seeing-degraded-segments set indicates an overall bias toward scoring repeat segments lower in a degraded context.

Figure 8

Figure 5. Paired histograms showing the calibration HITs (top) and scores for “BAD reference” items (bottom). This shows the same subset of annotators as Figure 1, with the x-axis again representing the 0-100 score range.

Supplementary material: File

Knowles and Lo supplementary material

Knowles and Lo supplementary material
Download Knowles and Lo supplementary material(File)
File 10.6 MB