Skip to main content Accessibility help
×
Home

Can machine translation systems be evaluated by the crowd alone

  • YVETTE GRAHAM (a1) (a2), TIMOTHY BALDWIN (a1), ALISTAIR MOFFAT (a1) and JUSTIN ZOBEL (a1)

Abstract

Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Can machine translation systems be evaluated by the crowd alone
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Can machine translation systems be evaluated by the crowd alone
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Can machine translation systems be evaluated by the crowd alone
      Available formats
      ×

Copyright

References

Hide All
Artstein, R., and Poesio, M., 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34 (4): 555–96.
Berger, A., Brown, P., Della Pietra, S., Della Pietra, V., Gillett, J., Lafferty, J., Mercer, R., Printz, H., and Ureš, L., 1994. The Candide system for machine translation. In Proceedings of the 1994 Human Language Technology Workshop, Plainsboro, NJ, pp. 157–62.
Bigert, J., 2004. Probabilistic detection of context-sensitive spelling errors. In Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, pp. 1633–6.
Bigert, J., Sjöbergh, J., Knutsson, O., and Sahlgren, M. 2005. Unsupervised evaluation of parser robustness. In Proceedings of the 6th International Conference on Computational Linguistics and Intelligent Text Processing, pp. 142–54. Mexico City, Mexico: Springer.
Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L., 2013. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, pp. 144.
Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-Amand, H., Soricut, R., Specia, L., and Tamchyna, A., 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MA, pp. 1258.
Bojar, O., Ercegovcevic, M., Popel, M., and Zaidan, O., 2011. A grain of salt for the WMT manual evaluation. In Proceedings of the 6th Workshop on Statistical Machine Translation, Edinburgh, Scotland, pp. 111.
Bond, F., Ogura, K., and Ikehara, S., 1995. Possessive pronouns as determiners in Japanese-to-English machine translation. In Proceedings of the Second Paclic Association for Computational Linguistics Conference: PACLING-95, Brisbane, Australia, pp. 32–8.
Brockett, C., Dolan, W., and Gamon, M., 2006. Correcting ESL errors using phrasal SMT techniques. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics, Sydney, Australia, pp. 249–56.
Callison-Burch, C., 2009. Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Suntec, Singapore, pp. 286–95.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J., 2007. (Meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 136–58.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J., 2008. Further meta-evaluation of machine translation. In Proceedings of the 3rd Workshop on Statistical Machine Translation, Columbus, OH, pp. 70106.
Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., and Zaidan, O., 2010. Findings of the 2010 Joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics MATR, Uppsala, Sweden, pp. 1753.
Callison-Burch, C., Koehn, P., Monz, C., Post, M., Soricut, R., and Specia, L., 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the 7th Workshop on Statistical Machine Translation, Quebec, Canada, pp. 1051.
Callison-Burch, C., Koehn, P., Monz, C., and Schroeder, J., 2009. Findings of the 2009 workshop on statistical machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, pp. 128.
Callison-Burch, C., Koehn, P., Monz, C., and Zaidan, O., 2011. Findings of the 2011 workshop on Statistical machine translation. In Proceedings of the 6th Workshop on Statistical Machine Translation, Edinburgh, Scotland, pp. 2264.
Callison-Burch, C., Osborne, M., and Koehn, P., 2006. Re-evaluating the role of BLEU in machine translation research. In Proceedings of the 11th European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 249–56.
Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 3746.
Culy, C., and Riehemann, S. Z., 2003. The limits of n-gram translation evaluation metrics. In Proceedings of the 9th Machine Translation Summit, New Orleans, LA, pp. 71–8.
Denkowski, M., and Lavie, A. 2010. Choosing the right evaluation for machine translation: an examination of annotator and automatic metric performance on human judgement tasks. In Proceedings of the 12th Conference of the Association of Machine Translation in the Americas, Denver, CO.
Dickinson, M., 2010. Generating learner-like morphological errors in Russian. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, pp. 259–67.
Dreyer, M., and Marcu, D., 2012. Hyter: meaning-equivalent semantics for translation evaluation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Quebec, Canada, pp. 162–71.
Fort, K., Adda, G., and Cohen, K. B. 2011. Amazon mechanical turk: gold mine or coal mine? Computational Linguistics 37 (2): 413–20.
Foster, J. 2007. Treebanks gone bad: International Journal of Document Analysis and Recognition (IJDAR) 10(3–4): 129–45.
Foster, J., and Andersen, Ø., 2009. GenERRate: generating errors for use in grammatical error detection. In Proceedings of the 4th Workshop on Innovative Use of Natural Language Processing for Building Educational Applications, Suntec, Singapore, pp. 8290.
Frederking, R., Nirenburg, S., Farwell, D., Helmreich, S., Hovy, E., Knight, K., Beale, S., Domashnev, C., Attardo, D., Grannes, D., and Brown, R., 1994. Integrating translations from multiple sources within the Pangloss Mark III machine translation system. In Proceedings of the 1st Conference of the Association of Machine Translation in the Americas, Columbia, MA, pp. 7380.
Graham, Y., and Baldwin, T., 2014. Testing for significance of increased correlation with human judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 172–6.
Graham, Y., Baldwin, T., Harwood, A., Moffat, A., and Zobel, J., 2012. Measurement of progress in machine translation. In Proceedings of the Australasian Language Technology Workshop, Dunedin, New Zealand: Australasian Language Technology Association, pp. 70–8.
Graham, Y., Baldwin, T., and Mathur, N., 2015. Accurate evaluation of segment-level machine translation metrics. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies, Denver, CO, pp. 1183–91.
Graham, Y., Baldwin, T., Moffat, A., and Zobel, J., 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings Seventh Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, pp. 3341.
Graham, Y., Baldwin, T., Moffat, A., and Zobel, J. 2014. Is machine translation getting better over time? In Proceedings of the 14th European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 443–51.
Graham, Y., Mathur, N., and Baldwin, T., 2014. Randomized significance tests in machine translation. In Proceedings of the 9th Workshop on Statistical Machine Translation, Baltimore, MA, pp. 266–74.
Gupta, N., Martin, D., Hanrahan, B., and O'Neill, J., 2014. Turk-life in India. In Proceedings of the 18th International Conference on Supporting Group Work, Sanibel Island, FL, pp. 111.
Hopkins, M., and May, J., 2013. Models of translation competitions. In Proceedings of the Fifty-First Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 1416–24.
Ikehara, S., Shirai, S., and Ogura, K., 1994. Criteria for evaluating the linguistic quality of Japanese to English machine translations. Journal of the Japanese Society for Artificial Intelligence 9: 569579.
Izumi, E., Uchimoto, K., Saiga, T., Supnithi, T., and Isahara, H., 2003. Automatic error detection in the Japanese learners’ English spoken data. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 145–8.
Koehn, P., 2012. Simulating human judgment in machine translation evaluation campaigns. In Proceedings of the 9th International Workshop on Spoken Language Translation, Miami, Florida, pp. 179–84.
Koehn, P., and Monz, C., 2006. Manual and automatic evaluation of machine translation between European languages. In Proceedings of the 1st Workshop on Statistical Machine Translation, New York City, NY, pp. 102–21.
Kumar, S., and Byrne, W., 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting, Boston, MA, pp. 169–76.
LDC. 2005. Linguistic data annotation specification: assessment of fluency and adequacy in translations. Linguistic Data Consortium. Revision 1.5.
Lee, J., and Seneff, S., 2008. Correcting misuse of verb forms. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Columbus, OH, pp. 174–82.
Lo, C., Addanki, K., Saers, M., and Wu, D., 2013. Improving machine translation by training against an automatic semantic frame based evaluation metric. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, pp. 375–81.
Lopez, A., 2008. Statistical machine translation. ACM Computing Survey 40 (3): 149.
Lopez, A., 2012. Putting human machine translation systems in order. In Proceedings of the 7th Workshop on Statistical Machine Translation, Quebec, Canada, pp. 19.
Machacek, M., and Bojar, O., 2014. Results of the WMT14 metrics shared task. In Proceedings of the 9th Workshop on Statistical Machine Translation, Baltimore, MA, pp. 293301.
Madnani, N., Resnik, P., Dorr, B. J., and Schwartz, R. 2008. Are multiple reference translations necessary? Investigating the value of paraphrased reference translations in parameter optimization. In Proceedings of the 10th Conference of the Association of Machine Translation in the Americas, Waikiki, HI.
Mathet, Y., Widlöcher, A., Fort, K., François, C., Galibert, O., Grouin, C., Kahn, J., Rosset, S., and Zweigenbaum, P., 2012. Manual corpus annotation: giving meaning to the evaluation metrics. In Proceedings of the 24th International Conference on Computational Linguistics: Posters, Mumbai, India, pp. 809–18.
Och, F. J., 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 160–67.
Okanohara, D., and Tsujii, J., 2007. A discriminative language model with pseudo-negative samples. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 7380.
Pevzner, L., and Hearst, M., 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28 (1): 1936.
Pierce, J. R., Carroll, John B., Hamp, E. P., Hays, David G., Hockett, C. F., Oettinger, A. G. and Perlis, A. 1966. Languages and machines: Computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Academy of Sciences, National Research Council.
Poesio, M., and Artstein, R., 2005. The reliability of anaphoric annotation, reconsidered: taking ambiguity into account. In Proceedings of Frontiers in Corpus Annotations II: Pie in the Sky, Ann Arbor, MI, pp. 7683.
Pierce, J. R., Carroll, J. B., Hamp, E. P., Hays, D. G., Hockett, C. F., Oettinger, A. G., and Perlis, A. 1966. ALPAC: Languages and machines: Computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, Washington, DC: National Academy of Sciences, National Research Council.
Przybocki, M., Peterson, K., Bronsart, S., and Sanders, G. 2009. The NIST 2008 metrics for machine translation challenge: overview, methodology, metrics and results. Machine Translation 23 (2–3): 71103.
Rozovskaya, A., and Roth, D., 2010. Training paradigms for correcting errors in grammar and usage. In Proceedings of Human Language Technologies: The Eleventh Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, pp. 154–62.
Sakaguchi, K., Post, M., and Durme, B. V., 2014. Efficient elicitation of annotations for human evaluation of machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MA, pp. 111.
Schwartz, L., Aikawa, T., and Quirk, C. 2003. Disambiguation of English PP attachment using multilingual aligned data. In Proceedings of the 9th Machine Translation Summit, New Orleans, LA.
Sjöbergh, J., and Knutsson, O. 2005. Faking errors to avoid making errors: very weakly supervised learning for error detection in writing. In Proceedings of Recent Advances in Natural Language Processing, Borovets, Bulgaria.
Smith, N., and Eisner, J., 2005a. Contrastive estimation: training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, MI, pp. 354–62.
Smith, N., and Eisner, J., 2005b. Guiding unsupervised grammar induction using contrastive estimation. In Proceedings of the International Joint Conference on Artificial Intelligence Workshop on Grammatical Inference Applications, Edinburgh, Scotland, pp. 7382.
Turian, J. P., Shen, L., and Melamed, I. D., 2003. Evaluation of machine translation and its evaluation. In Proceedings of the 9th Machine Translation Summit, New Orleans, LA, pp. 386–93.
Wagner, J., Foster, J., and van Genabith, J., 2007. A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language Processing and the Conference on Computational Natural Language Learning, Prague, Czech Republic, pp. 112–21.
Wagner, J., Foster, J., and van Genabith, J., 2009. Judging grammaticality: experiments in sentence classification. Computer-Assisted Language Instruction Consortium Journal 26 (3): 474–90.
White, J. S., O'Connell, T., and O'Mara, F., 1994. The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association of Machine Translation in the Americas, Columbia, MA, pp. 193205.
Yamron, J., Cant, J., Demedts, A., Dietzel, T., and Ito, Y., 1994. The automatic component of the LINGSTAT machine-aided translation system. In Proceedings of the 1994 Human Language Technology Workshop, Plainsboro, NJ, pp. 163–8.
Yuan, Z., and Felice, M., 2013. Constrained grammatical error correction using statistical machine translation. In Proceedings of the 7th Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 5261.

Can machine translation systems be evaluated by the crowd alone

  • YVETTE GRAHAM (a1) (a2), TIMOTHY BALDWIN (a1), ALISTAIR MOFFAT (a1) and JUSTIN ZOBEL (a1)

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed