Skip to main content Accessibility help
×
Hostname: page-component-68c7f8b79f-xmwfq Total loading time: 0 Render date: 2025-12-23T12:12:43.331Z Has data issue: false hasContentIssue false

Chapter 1 - A Brief Recounting of the First Four Millennia of Testing

Published online by Cambridge University Press:  10 October 2025

Daniel H. Robinson
Affiliation:
University of Texas, Arlington

Summary

We trace the origins of testing to its civil service roots in Xia Dynasty China 4,000 years ago, to the Middle East in Biblical times, and to the monumental changes in psychometrics in latter half of the twentieth century. The early twentieth century witnessed the birth of the multiple-choice test and a focus on measuring cognitive ability rather than knowledge of content – influenced greatly by IQ and US Army placement testing. Multiple-choice tests provided an objectivity in scoring that had previously eluded the standard essays used in college entrance exams. The field of testing began to take notice of measurement errors and strove to minimize them. Computerized Adaptive Tests (CAT) were developed to accurately measure a person’s ability with the fewest number of items. The future advancement of testing is dependent on a continued process of experimentation to determine what improves and what does not.

Information

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2025

Chapter 1 A Brief Recounting of the First Four Millennia of Testing

In this chapter, and in others to follow, we begin with a discussion of history.Footnote 1 We do this for two reasons. First, because we agree with Maya Angelou, that “history, despite its wrenching pain, cannot be unlived, but if faced with courage, need not be lived again.” And second, to convey some of the validity of the theory and methods of testing that derives simply by virtue of the weight of experience. Tests were not devised last week. No, they have been actively under development for thousands of years and used to solve the very real problems that are associated with the fair and efficient utilization of human resources. These problems, in complex combinations, persist to this day. In the many years of testing’s existence, there were uncounted numbers of issues that arose and had to be thoughtfully dealt with. Had their solutions not been successfully managed, testing could not have survived. Four of these issues remain particularly relevant today.

  1. 1. Questions of generalizability – what bearing do the results of this test taken today have on a somewhat different problem at a somewhat different time?

  2. 2. Questions of reliability – how similar are the results generated by the test this morning to those generated this afternoon? Tomorrow? Next week? (See Section 2.4 for an expanded discussion of these issues.)

  3. 3. Questions of validity – is the test generating evidence related to the claims we want to make?

  4. 4. Questions of fairness – are the decisions made from the test’s results influenced by factors (e.g., race, ethnicity, age, gender) unrelated to the competencies of relevance?

Some of the pathways around these issues have become overgrown with disuse, others have been broadened with fashion, but all have been carefully examined. We do not claim that new problems cannot arise, because, for example, of new technologies or modern sensitivities borne of our increasingly heterogeneous society; nor do we claim that the solutions of the past cannot be improved upon. We only claim that a knowledge of history can reduce the necessity of revisiting old problems that have long been resolved.

There is nothing new in the world except the history you do not know.

Harry S. Truman (Rushay, Reference Rushay2009)

1.1 The Fundamental Tenet of Testing

Serendipitously, as we sat down to write this book in 2023, we noted that, exactly 4,000 years ago (on Wu Yue Wu Hao in the 93rd year of the Xia Dynasty – August 5, 1977 BCE) the signal event occurred that marked the beginning of testing.Footnote 2 It was on this date that an anonymous functionary, tasked by his emperor to find a path to improve the way that government officials were chosen, was struck by what has since been recognized as

the fundamental tenet of testing:

A small sample of behavior, measured under controlled circumstances, could be predictive of a broader set of behaviors in uncontrolled circumstances.

This was the intellectual beginning of a program of testing in China that has continued, with only one minor interruptionFootnote 3 until the present day. The Chinese testing program imagined by that functionary, his very bones now long dust, was essentially a civil service examination program and many of its performance-based components and procedures bore a remarkable resemblance to most examination programs that have been developed since. The Chinese tests were designed to cull candidates for public office; job-sample tests were used, with proficiency in archery, arithmetic, horsemanship, music, writing, and skill in the rites and ceremonies of public and social life. Moreover, because they required objectivity, candidates’ names were concealed to ensure anonymity; they sometimes went so far as to have the answers redrafted by a scribe to hide the handwriting. Tests were often read by two independent examiners, with a third brought in to adjudicate differences. Test conditions were as uniform as could be managed – proctors watched over the exams given in special examination halls that were large, permanent structures consisting of hundreds of small cells. The examination procedures were so rigorous that candidates sometimes died during the course of the exam (Teng, Reference Teng1943).

The pathway connecting ancient China to the twenty-first century is remarkably direct. The Chinese testing program became the model that the British used in their design of the Indian Civil Service Exam system, installed in 1833 during the Raj. This, in turn, was the template that Senator Charles Sumner and Representative Thomas Jenckes used in designing the Civil Service Act passed by the U.S. Congress in January 1883.

1.2 Tracing the Shibboleth

Despite this clear historical pathway, testing did not jump directly from China to India. There is overwhelming evidence that this fundamental idea spread throughout the known world. About 600 years after the Chinese testing program began, we know that it had spread at least as far as the Middle East.

In the Bible, Judges 12:4–6, during the time of David, we are told how, after the Gileadites captured the fords of the Jordan leading to Ephraim, they developed a one-item test to determine the tribe to which the survivors of the battle belonged. “If a survivor of Ephraim said, ‘Let me cross over,’ the men of Gilead would ask him, ‘Are you an Ephraimite?’ If he replied, ‘No,’ they said, ‘All right, say “Shibboleth”’. If he said, ‘Sibboleth,’ because he could not pronounce the word correctly, they seized and killed him.”

Forty-two thousand Ephraimites were killed at that time. This total might have included some Gileadites with a lisp, but we will never know, for there is no record of any validity study being performed.

There have, however, been two fundamental changes in testing, especially over the past century or so. The first grew from the realization that the subjectivity involved in scoring answers that were constructed by examinees yielded enormous error. And so gradually there has been a shift toward constructing items that could be scored objectively. A little of the evidence supporting this shift was evident in ancient China. The second shift was in test construction paradigms that went from considering the test as a single entity (where the examinee’s score was usually represented as the proportion of the presented items that were answered correctly), to a much more flexible form in which the test is drawn from a large pool of components – some of which are selected as needed to estimate the examinee’s ability in some optimal fashion. Thus, the individual test item, or sometimes a fixed combination of items – a testlet – became the fungible unit of the test.

This shift in test structure was captured by three signal events in the four decades between 1950 and 1990. The first was the 1950 publication of Harold Gulliksen’s Theory of Mental Tests, which provided the machinery necessary for rigorous scoring of tests using what has become known as True Score Theory. The unit of measure was the test itself, and so the proportion of the test that was answered correctly characterized performance. Different forms of the test were equated so that the scores on different forms could be compared.

The second breakthrough was the 1968 publication of Fred Lord and Mel Novick’s Statistical Theories of Mental Test Scores. This signaled a new era in which tests could be built that were customized for each examinee and yet still be standardized. It put a capstone on true score theory while simultaneously providing a rigorous statement of a new approach in which the test item becomes the fungible unit of measurement – item response theory (IRT). IRT was crucial if the goal of efficiently creating individualized tests for each examinee was to be realized. Such a dream was instigated by the third event, the growing power and availability of high-speed computing. A combination of individually calibrated test items, a statistical theory that allowed us to calculate comparable scores for tests that might be made up of wildly different mixtures of items, and a computer that could construct such tests on-the-fly has yielded the most modern realization of what was begun 4,000 years ago by that anonymous Chinese functionary. IRT made it possible to use a large pool of items from which one could sample to make up any particular individual’s test. This portended a major improvement in test security.

The revolutionary character of the third breakthrough, which ironically took place in 1917 in Vineland, New Jersey, was only recognized much later. It was at that meeting that the focus of exams shifted abruptly from testing knowledge of specific topics to testing the cognitive ability of the examinee. This remarkable change is explored in greater detail in Chapter 3.

1.3 The Need for a Second Watch

The expression “A man with one watch knows what time it is, but a man with two watches is never sure” helps us understand the evolution of testing. Newton gave the world its first watch, and for a while we knew the time; but eventually Einstein and Heisenberg gave us a second watch, and we haven’t been sure since. Sometimes this is interpreted as an argument for ignorance, which is quite the opposite of our point, for science advances when we have some notion of our own uncertainty. And it is only when we have multiple measures of the same thing that we are made aware of our own uncertainty, and these multiple measures enable us to assess that uncertainty.

Written exams required an expert grader to assess the quality of the answer, and the examinees’ scores were typically a summary of those assessments. We can only imagine the chain of events that led the West to revert to the Chinese practice of having multiple graders. Perhaps some examinees complained about their scores and when they were rescored a different result ensued; or accidentally some exams were scored twice, and it was discovered that the results were not the same. We don’t know what the key motivating events were, but certainly it was clear by the beginning of the twentieth century that fair scoring of exams required a second watch.

One of the instigators of this movement was the work of Albert Binet and Theodore Simon who, in 1905, published their eponymous test to measure the intelligence of children. Though the individually administered test was cumbersome, it was wildly successful. A decade later this success led Stanford University’s Lewis Terman to develop a less cumbersome version that could be both mass-administered and objectively scored.

Terman’s success, coupled with the need for the efficient classification of soldiers for the First World War, drove the formation of Robert Yerkes’ Vineland Committee in 1917, which, within just a week, developed the modern multi-part multiple-choice exam. Its eight sections were designed to be administered in about an hour and among those eight were such familiar item types as arithmetic reasoning, synonym–antonyms, and verbal analogies. The test forms they prepared, then called “the Army Alpha” (to distinguish it from “Army Beta,” the nonverbal version for illiterate and non-English speaking examinees) was a testing model followed widely, with only modest differences, ever since. The exam could be administered quickly and scored objectively and automatically using a stencil.

In 1909 the College Board, through its fledgling college entrance exams, made a remarkable discovery. It found that the variation observed in the scoring of a single essay over many graders was about the same as the variation in scores of the same essay over many examinees. It concluded that this was unacceptably inaccurate and that graders needed to be trained better. Almost a century later, in a study of licensing exam results for California teachers, the renowned psychometrician Darrell Bock (Reference Bock1991) reported, “the variance component due to raters was equal to the variance component due to examinees.” It may be that this century-old problem is still due to insufficient training of graders, but more likely it is the subjectivity inherent in the task of grading itself.

As testing became more widespread, and with experience, our eyes became accustomed to the Byzantine dimness surrounding its use, the necessity of multiple graders added costs to testing. But balanced against these costs was the savings derived from the improved manpower utilization due to testing. The spreading belief that expanding testing beyond its relatively narrow confines would improve industrial efficiency led to attempts to streamline the practices of testing without compromising its efficacy. Twentieth-century psychometrics realized that fair scoring of exams required a second watch.

1.4 The Critical Importance of Objective Scoring

The lessons learned from the variability always present in human scoring led to a shift away from traditional test items like essays, which required a constructed response on the part of the examinee, to item formats sharing the objective nature of the multiple-choice item.

The need for economically practical mass administration of tests that faced the U.S. military during the First World War gave rise to a huge increase in the development of multiple-choice items. Then, as now, there was concern that such a format was incapable of testing certain proficiencies that were crucial. Sometimes these concerns were well founded, but surprisingly often it turned out that the multiple-choice option worked better than even its most ardent supporters could have hoped. Why?

The answer draws on the psychometric/statistical developments described in Gulliksen’s foundational book and stems from the basic fact that scores derived from any test format are imperfect. They contain errors. These errors fall principally into two broad categories.

  1. (i) The estimates fluctuate symmetrically around their true values, due to variations in the examinee and the scorer. On some days we perform better than on others. As mentioned earlier, variation among raters of essays has always been substantial and seems relatively insensitive to improved rater training. In addition, scores also fluctuate due to the specific realization of the test item; if we want to study writing and so ask for an essay on Kant’s epistemology, we are likely to get less fluid responses than if we asked for one on ‘My Summer Vacation.’

  2. (ii) The estimate can also contain some bias if the item used is measuring a proficiency that is not exactly what we are specifically concerned about. Suppose, for example, we are interested in measuring writing ability and instead of testing it in the obvious way, by asking the examinee to write an essay, we use multiple-choice items designed to measure general verbal ability (e.g., items involving verbal analogies, antonym/synonym interpretation, sentence completion). We are measuring something related to writing ability, but not writing ability specifically.

The test that is most predictive of future behavior is one that minimizes the sum of both kinds of errors. What the experience gained in the first half of the twentieth century (and reconfirmed many times since) was that for a remarkably wide range of proficiencies, multiple-choice items were superior to their much older cousin, the essay. This surprising result is because the bias that multiple-choice items might introduce was much smaller than the errors introduced by subjective scoring and the limitations in breadth of subject matter coverage that are the unavoidable concomitants of essay style exams. In one study (Wainer et al., Reference Wainer, Lukele and Thissen1994), examinees were given a test made up of three half hour sections. Two of the sections required the writing of an essay; the third comprised forty multiple-choice verbal ability items. The essays were each scored with two raters (and sometimes a third to adjudicate any large disagreements), and the multiple-choice section was scored automatically. It was found that the score on the multiple-choice section was more highly correlated with either essay score than the two essay scores were with one another. What this means practically is that if we want to predict performance on a future essay test, we could do so more accurately with a multiple-choice test than we could with a parallel essay test. Some argued that 30 minutes is too short for a valid essay test – perhaps, but if the essays were allocated an hour, the one-hour multiple-choice test would also improve, probably more than the essays.Footnote 4

It is worth emphasizing that for any fixed amount of testing time there is an advantage to asking many small questions over asking very few large ones. In the latter case an unfortunate choice of question can yield an equally unfortunate outcome (“I knew just about everything in that subject except that one small topic”). In the former case, there is still the possibility of such unfortunate choices, but through the larger sampling of topics, the effect of such bad luck is ameliorated considerably.

The shift from an essay to a multiple-choice format yielded enormous benefits; the material covered by the test could be vastly expanded, the test forms were scored much more accurately, and thus the inferences made from tests’ scores became more valid, the scores themselves were more reliable, and simultaneously the exams were much cheaper to administer.

Military testing, which had played such an important role in shifting the testing paradigm to the multiple-choice format in 1917, made possible another breakthrough 50 years later, which developed from the possibility of computerizing the test administration. The principal test that the U.S. military gives to sort recruits into various training programs is the Armed Services Vocational Aptitude Battery (the ASVAB). It was a long test with ten parts and required two days to administer. In 1960s and 1970s, in an effort to reduce this time and to help control other problems, the Office of Naval Research funded research first by Educational Testing Service (ETS)’s Fred Lord and later Minnesota’s David Weiss. What they came up with was a way to meld the strengths of individualized assessment and the standardization and reliability of modern multiple-choice tests.

The aim was to construct a practical and standardized equivalent of a wise-old examiner who would sit with a candidate for an extended period of time and tap into all aspects of the candidate’s skills and knowledge, asking neither more nor fewer questions than required for the accuracy of the inferences planned. This is especially important for tests used for diagnostic purposes. This remarkable goal was, in fact, accomplished by presenting previously calibrated individual test items to the examinees on a computer. After each item was presented, it was scored instantaneously and the computer selected another item. If the earlier item was answered incorrectly, an easier one was presented; if it was answered correctly, it was followed by a more difficult one. In this way the test would focus quickly on the ability level of the examinee. This allowed the test to yield the same precision as a typical fixed format test, while using only about half the items. It can also cycle through various subtopics as required for diagnostic testing. Such tests are called computerized adaptive tests or CATs for short.

The shift was formally chronicled by the 1990 publication of the definitive text Computerized Adaptive Testing (Wainer, Reference Wainer1990), which laid out and illustrated how a test could be individually constructed to suit each examinee while also being standardized. Now scores obtained from exams that were very different in the items from which they were constituted could be compared directly.

Because Wainer’s book also laid out in detail how such exams could be built and scored, it motivated several testing organizations to try out this new technology. What many discovered was that building a CAT requires a great deal of work. Huge banks of items must be constructed, pre-tested, and calibrated so that the item administration algorithm can select them appropriately. In the first rush of enthusiasm, a fair number of large-scale tests had been successfully converted into CAT format (e.g., the ASVAB) and remain so to this day. Others (e.g., the Graduate Record Exam (GRE)) were transformed into CATs only to be changed back after it was found that the extra expense involved was not justified.Footnote 5

CAT remains an attractive option if the use of the test includes instructional diagnosis. For this purpose, the ability to efficiently isolate precisely those areas that need remediation is likely invaluable. But the evolution of testing is not done yet, for as technology yields new challenges to security, it also provides us with new tools to meet them.

One vexing problem associated with computerized administration is the practical requirement that tests be administered continuously. It was just too expensive to utilize the old tactic of gathering hundreds of examinees in a gymnasium three or four times a year and passing out #2 pencils; while #2 pencils are cheap, laptop computers are not. So instead, testing centers are set up and examinees are scheduled to come to them a few at a time. Such an approach has substantial extra costs: maintaining testing centers is expensive, and the continuous testing required by computerized testing yields security challenges. Giving a test continuously means that the items used on the tests must be changed very often; consequently, there must be a lot of them. But there is no going back: modern tests that are computer administered have many item types that cannot be administered in a paper and pencil format (e.g., ones where an examinee listens to a recorded heartbeat and must make a diagnosis). Technology may again provide a solution. Tablet computers have become both capable and relatively inexpensive. We may soon be able to return to the old style of a few mass administrations in which the twenty-first century #2 pencil is a specially made tablet. This is but one of the likely future possibilities of exams. But what is critical is an orienting attitude of questioning the status quo to power the drive toward improvement through change. This attitude melds perfectly with the tenets of modern quality control (Deming, Reference Deming2012). Whenever we have a complex system, whether it is a manufacturing process or the licensing of physicians, it is well established that the worst way to improve matters is to convene a blue-ribbon panel to lay out the character of the future of exams. This doesn’t work reliably because the task is too difficult.Footnote 6 What does work is the institutionalization of a constant process of experimentation in which small changes are made and evaluated. If the change improves matters, make a bigger change in the same direction. If it makes matters worse, reverse field and try something else. In this way the process gradually moves toward optimization and so when the future arrives the testing process of the future is waiting for it.

Footnotes

1 We do this despite Andrew Moravcik’s (Reference Moravcik2017) warning of the futility of this task “Those who write history are doomed to watch others repeat it.” Hope springs eternal.

2 This recounting tells of a place faraway and a time long ago; before there was reliable written Chinese history. Some of the details are not fully documented.

3 During the 1911 revolution that overthrew the Qing Dynasty and its Emperor, the national testing program was temporarily suspended. It was resumed as soon as peace was restored.

4 The unreliability of judges is widespread, but not universal. It is certainly true for rating such outcomes as essays, but for very narrowly defined tasks, expert judgment can be workable. For example, in judging the severity of hip fractures by orthopedic surgeons it was found that only 5 percent of the variability of responses was due to differences in opinion among raters, while 95 percent was due to variability among X-rays (Baldwin et al., Reference Baldwin, Bernstein and Wainer2009).

5 The two competing costs are those associated with building and administering the test: the costs of writing, pretesting, and calibrating a large number of items that span all of the subject areas and all of the difficulty levels; and the costs of individual administration. Balanced against these costs are the savings in examinee time, since an adaptive test takes roughly half the time as a fixed-format test of equal accuracy. For the military it meant that the test would shift from being a two-day affair, with all of the housing and other costs associated with an overnight stay, to a one-day test. In addition, there was the saving of the opportunity costs for the second day. In the end it was determined that the personnel savings offset the testing costs. Conversely, the examinee costs for the GRE were not borne by the testing organization that administered it and so the little savings achieved did not justify the expenses incurred. It is likely that, if the CAT-GRE could have been mass administered, rather than administered continuously, the financial calculus would have been different.

6 In this obscure footnote we record for posterity a wonderful story to tell about this. Around 1990 a large, blue-ribbon, committee was formed at ETS for a project called “TOEFL 2000” whose goal was to design a future language test that would accommodate both the recent developments in psycholinguistics and the likely technological tools that would be available a decade hence. The committee’s budget was several million dollars because it was anticipated that the planning process would use many hours of many expensive professionals. At the initial meeting a fair amount of time was spent going around the (large) table soliciting advice as to how they should proceed to make best use of the money. Bob Mislevy, one of the world’s leading psychometricians suggested that, “we should buy a helicopter.” This provoked nervous laughter, but Bob wasn’t laughing – he was deadly serious. He explained that if they continued trying to ascertain from current knowledge the test of the future, they would continue meeting until the money ran out and, when the year 2000 arrived, they would have nothing to show for it. But, if they followed his advice, at least they would have a helicopter. For a long time thereafter whenever someone suggested a blue-ribbon panel to solve some difficult problem it was greeted with “better to buy a helicopter.”

Accessibility standard: Inaccessible, or known limited accessibility

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

The HTML of this book is known to have missing or limited accessibility features. We may be reviewing its accessibility for future improvement, but final compliance is not yet assured and may be subject to legal exceptions. If you have any questions, please contact accessibility@cambridge.org.

Content Navigation

Table of contents navigation
Allows you to navigate directly to chapters, sections, or non‐text items through a linked table of contents, reducing the need for extensive scrolling.
Index navigation
Provides an interactive index, letting you go straight to where a term or subject appears in the text without manual searching.

Reading Order & Textual Equivalents

Single logical reading order
You will encounter all content (including footnotes, captions, etc.) in a clear, sequential flow, making it easier to follow with assistive tools like screen readers.
Short alternative textual descriptions
You get concise descriptions (for images, charts, or media clips), ensuring you do not miss crucial information when visual or audio elements are not accessible.
Full alternative textual descriptions
You get more than just short alt text: you have comprehensive text equivalents, transcripts, captions, or audio descriptions for substantial non‐text content, which is especially helpful for complex visuals or multimedia.

Visual Accessibility

Use of colour is not sole means of conveying information
You will still understand key ideas or prompts without relying solely on colour, which is especially helpful if you have colour vision deficiencies.
Use of high contrast between text and background colour
You benefit from high‐contrast text, which improves legibility if you have low vision or if you are reading in less‐than‐ideal lighting conditions.

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×