Hostname: page-component-89b8bd64d-b5k59 Total loading time: 0 Render date: 2026-05-08T18:24:23.246Z Has data issue: false hasContentIssue false

A survey of 25 years of evaluation

Published online by Cambridge University Press:  19 July 2019

Kenneth Ward Church*
Affiliation:
Kenneth Ward Church, Baidu, Sunnyvale, CA 94089, USA
Joel Hestness
Affiliation:
Cerebras Systems, Los Altos, CA 94022, USA
*
*Corresponding author. Email: kenneth.ward.church@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

Evaluation was not a thing when the first author was a graduate student in the late 1970s. There was an Artificial Intelligence (AI) boom then, but that boom was quickly followed by a bust and a long AI Winter. Charles Wayne restarted funding in the mid-1980s by emphasizing evaluation. No other sort of program could have been funded at the time, at least in America. His program was so successful that these days, shared tasks and leaderboards have become common place in speech and language (and Vision and Machine Learning). It is hard to remember that evaluation was a tough sell 25 years ago. That said, we may be a bit too satisfied with current state of the art. This paper will survey considerations from other fields such as reliability and validity from psychology and generalization from systems. There has been a trend for publications to report better and better numbers, but what do these numbers mean? Sometimes the numbers are too good to be true, and sometimes the truth is better than the numbers. It is one thing for an evaluation to fail to find a difference between man and machine, and quite another thing to pass the Turing Test. As Feynman said, “the first principle is that you must not fool yourself–and you are the easiest person to fool.”

Information

Type
Emerging Trends
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© Cambridge University Press 2019
Figure 0

Table 1. Some popular metrics

Figure 1

Table 2. Some popular tasks/benchmarks

Figure 2

Figure 1. There may not be a single unique correct label. Candidate labels: baseball cap, cap, green hat, hat, and head. Can you guess which one is in the gold standard?

Figure 3

Figure 2. Thirty years of progress in speech recognition.