Hostname: page-component-89b8bd64d-7zcd7 Total loading time: 0 Render date: 2026-05-09T01:38:43.922Z Has data issue: false hasContentIssue false

Emerging trends: evaluating general purpose foundation models

Published online by Cambridge University Press:  12 December 2024

Kenneth Ward Church*
Affiliation:
Northeastern University, Boston, MA 02139, USA
Omar Alonso
Affiliation:
Amazon, Palo Alto, CA, USA
*
Corresponding author: Kenneth Ward Church; Email: k.church@northeastern.edu
Rights & Permissions [Opens in a new window]

Abstract

We suggest that foundation models are general purpose solutions similar to general purpose programmable microprocessors, where fine-tuning and prompt-engineering are analogous to coding for microprocessors. Evaluating general purpose solutions is not like hypothesis testing. We want to know how well the machine will perform on an unknown program with unknown inputs for unknown users with unknown budgets and unknown utility functions. This paper is based on an invited talk by John Mashey, “Lessons from SPEC,” at an ACL-2021 workshop on benchmarking. Mashey started by describing Standard Performance Evaluation Corporation (SPEC), a benchmark that has had more impact than benchmarks in our field because SPEC addresses an import commercial question: which CPU should I buy? In addition, SPEC can be interpreted to show that CPUs are 50,000 faster than they were 40 years ago. It is remarkable that we can make such statements without specifying the program, users, task, dataset, etc. It would be desirable to make quantitative statements about improvements of general purpose foundation models over years/decades without specifying tasks, datasets, use cases, etc.

Information

Type
Emerging Trends
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press
Figure 0

Table 1. It is easy to show that deep nets are becoming larger and larger over time (Church 2022), but harder to make quantitative statements about improvements in quality over time. Can we quantify the impact of this progress for typical customers?

Figure 1

Table 2. SPECRatios have grown by a factor of 50,000 over 40 years

Figure 2

Table 3. SPEC CINT92 suite (from Table 2 in Giladi and Ahituv (1995))

Figure 3

Table 4. SuperGLUE results from Table 9 in (Sun et al.2021). Humans are better than machines on 4 of 8 tasks. If the outlier in yellow is dropped, Human is ahead

Figure 4

Table 5. Results from Table 4, relative to human performance. GM and GMrobust are geometric means with and without the outlier in yellow