Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-07T17:34:15.852Z Has data issue: false hasContentIssue false

Can machine translation systems be evaluated by the crowd alone

Published online by Cambridge University Press:  16 September 2015

YVETTE GRAHAM
Affiliation:
Department of Computing and Information Systems, The University of Melbourne, Parkville, 3010 VIC, Australia e-mail: tbaldwin@unimelb.edu.au, ammoffat@unimelb.edu.au, jzobel@unimelb.edu.au ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin, College Green, Dublin 2, Ireland e-mail: graham.yvette@gmail.com
TIMOTHY BALDWIN
Affiliation:
Department of Computing and Information Systems, The University of Melbourne, Parkville, 3010 VIC, Australia e-mail: tbaldwin@unimelb.edu.au, ammoffat@unimelb.edu.au, jzobel@unimelb.edu.au
ALISTAIR MOFFAT
Affiliation:
Department of Computing and Information Systems, The University of Melbourne, Parkville, 3010 VIC, Australia e-mail: tbaldwin@unimelb.edu.au, ammoffat@unimelb.edu.au, jzobel@unimelb.edu.au
JUSTIN ZOBEL
Affiliation:
Department of Computing and Information Systems, The University of Melbourne, Parkville, 3010 VIC, Australia e-mail: tbaldwin@unimelb.edu.au, ammoffat@unimelb.edu.au, jzobel@unimelb.edu.au
Rights & Permissions [Opens in a new window]

Abstract

Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

Information

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 
Figure 0

Fig. 1. Screenshot of the adequacy assessment interface, as presented to an AMT worker. All of the text is presented as an image. The slider is initially centered; workers move it to the left or right in reaction to the question. No scores or numeric information are available to the assessor.

Figure 1

Fig. 2. Screenshot of the fluency assessment interface, as presented to an AMT worker. Many of the details are the same as for the adequacy assessment shown in Figure 1.

Figure 2

Table 1. Human Intelligence Task (HIT) approval and rejection in experiments

Figure 3

Table 2. Numbers of workers and translations, before and after quality control (broken down based on Assumption A only, and both Assumptions A&B)

Figure 4

Fig. 3. Spanish-to-English significance test outcomes for each method of human evaluation. Colored cells indicate that the scores of the row i system are significantly greater than those of the column j system. The average number of judgments per system is shown in parentheses. The top row shows the official results from WMT-12; the bottom row show the results based on our method.

Figure 5

Table 3. Average time per assessment (seconds) with fluency and adequacy direct estimate assessments and WMT-12 relative preference assessments

Figure 6

Table 4. Spanish-to-English mean human adequacy and fluency scores (‘z’ is the mean standardized z-score, and ‘n’ is the total number of judgments for that system after quality filtering is applied)

Figure 7

Table 5. English-to-Spanish mean human adequacy and fluency scores (‘z’ is the mean standardized z-score, and ‘n’ is the total number of judgments for that system after quality filtering is applied)

Figure 8

Fig. 4. English-to-Spanish significance test outcomes for each method of human evaluation. Colored cells indicate that the scores of the row i system are significantly greater than those of the column j system. The average number of judgments per system is shown in parentheses. The top row shows the official results from WMT-12; the bottom row show the results based on our method.

Figure 9

Table 6. Proportions of significant differences between system pairs identified at different significance thresholds, using the WMT-12 relative preference judgments, and the new direct estimate method

Figure 10

Fig. 5. Standardized adequacy significant differences between pairs of systems for increasing numbers of judgments per system, sampled according to earliest HIT submission time, for the twelve Spanish-to-English WMT-12 systems. These four heat maps can be directly compared to the lower-left heat map in Figure 3, which is constructed using an average of 1,280 judgments per system.