To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Horace Mann can be credited with the beginning of accountability and high-stakes testing in K-12 education in the 1800s. This was also the beginning of test fraud. Terman later developed the National Intelligence tests for K-12, followed by the Stanford Achievement Tests and the Iowa Test of Basic Skills. Results of such tests have been used, unwisely, to drive school reform efforts. The National Assessment of Educational Progress (NAEP), Moynihan and Coleman reports in the 1960s, and A Nation at Risk in the 1980s continued to drive educational reform efforts such as No Child Left Behind, the Every Student Succeeds Act, and Race to the Top today. Using test scores to make decisions about hiring and firing of teachers and administrators is ill advised. Reform efforts over the past 60 years have not reduced the achievement gap. K-12 tests reveal societal, not educational shortcomings.
The ongoing development of a Swiss Health Data Space (SHDS) presents an opportunity to transform health delivery and care by enabling large-scale secondary health research. The successful implementation of the SHDS depends on its trustworthiness, as public trust is closely linked to public participation in data-sharing initiatives. We conducted four focus groups across the German-, French-, and Italian-speaking regions of Switzerland to identify public expectations and requirements related to the attributes that define a trustworthy SHDS. The participants discussed four fictitious case studies on: (1) consent management; (2) record linkage via the national social security number; (3) national data coordination center; and (4) cross-border data exchange. To best inform Swiss policy, we held a panel discussion with patient experts and healthcare professionals to translate the focus group findings into governance and public communication recommendations. Policy recommendations are proposed based on insights from the fictitious case studies discussed with participants, accompanied by guidance on implementation measures that contribute to proactively building trust in the development of the SHDS. Communication recommendations are further provided, highlighting that the success of the SHDS will depend on early and continuous trustworthy public communication efforts that actively engage the Swiss public, address their concerns, and foster support throughout its development. Overarching these efforts will be a foundational governance approach that meaningfully involves relevant stakeholders and members of the Swiss public, while allocating appropriate responsibility to maintain trustworthiness of the SHDS.
The use of test scores as evidence to support the claims made for them requires an understanding of causal inference. We provide a careful discussion of the modern theory of causal inferences with numerous evocative illustrations, including an admissions policy at the University of Nebraska, the 1854 London cholera epidemic, and the 1960s decline in SAT scores. We show how evidence drawn from test scores is comparable to credible evidence of other widely accepted sources. Rubin’s model for causal inference is explained and the importance of manipulation, random assignment, potential outcomes, and a control group is emphasized. The Tennessee Class Size Experiment of the 1980s is one of the best examples of how to measure the effects of a cause. Finally, we show how the size of the causal effect of fracking on earthquakes in Oklahoma can be established using an observational study by mirroring the structure of an experiment. Measuring the size of causal effects of testing and its alternatives requires data and control. Often, the data are kept hidden to avoid ruining the good with the truth.
Many colleges that required SAT or ACT scores before the pandemic suspended them during it. After the dangers of the pandemic subsided most have not yet resumed their use. The arguments supporting their continued suspension are based primarily on the fact that such tests, like most other tests, show differences among subgroups (e.g., races). We discuss the costs and benefits of no longer using such tests scores in admission decisions. College admission tests were developed in the 1920s to level the playing field and allow more students to qualify for college. Carl Brigham developed the Scholastic Aptitude Test (SAT) in 1926. Soon, the College Board adopted the SAT. In 1959, the American College Test (ACT) was born. Neither test is biased against minorities – rather they tend to overpredict minority performance in college. Yet, despite persistent group differences, the sentiment is to discontinue use of these tests. Doing so will place more emphasis on other metrics (e.g., high school GPA) that are less reliable, more subjective, and also prone to group differences. Admitting more students who are less likely to graduate comes with costs.
Although the use of testing has been of remarkable value for millennia, and has improved steadily over the past century, it is now experiencing heightened public dissatisfaction due partly to concerns regarding fairness and equity. We discuss some plausible causes for this apparent change in public attitudes. Only about 10% of all colleges and universities now require the ACT or SAT for admission. Fewer states are using tests to measure K-12 student progress and as a requirement for graduation. The major complaint about tests is preventing improvement through inclusion. But in reality testing simply measures this improvement as more groups have been included over the years. Virtuosos in music and the world record for running the mile are examples. Admissions testing was first developed to improve fairness to a system that relied on quotas. Compared to other metrics, tests are the only ones subjected to rigorous evaluation for reliability and validity.
Many professions (e.g., teachers, pilots, air traffic controllers, physicians) require applicants to pass a licensing exam, the principal purpose of which is to protect the public from incompetent practitioners. These exams also sometimes show the same sorts of race and sex differences observed in other test scores. Thus, they too are susceptible to equity criticisms. We discuss the implications of getting rid of such tests or even just lowering cutoff scores. Medical licensing has been around for over 1,000 years. The U.S. did not start licensing physicians until the late 1800s. Early exams were oral, subject to criticisms about objectivity, and resulted in disaster in West Virginia. Ultimately, the National Board of Medical Examiners was formed and multiple-choice exams replaced essay exams on the United State Medical License Exam (USMLE). To get into medical school, undergraduates must take the Medical College Admission Test (MCAT). Like other tests, the MCAT reveals race and sex differences. The same is true for tests to license pilots and air traffic controllers. K-12 teacher licensing formally began with the National Teacher Examination (NTE) in 1940.
How far have we come? What strategies will most likely aid in achieving our goals? What evidence must be gathered to go further? We have focused in the book on how tests provide valuable information when making decisions about who to admit, who to hire, who to license, who to award scholarships to, and so on. Given limited resources, efficiency in selection should be essential. However, tests used for these purposes also reveal race and sex differences that conflict with society’s desire for fairness. How do we make policies and decisions so as to maximize efficiency while also minimizing adverse impact? There is no statistical solution to this problem. We suggest an approach that will get us closer to an acceptable solution than where we currently stand. A first step is to gather all relevant data so that any selection policy can be evaluated as to both kinds of errors. Second, make such data publicly available so that all interested parties can have access and everything is transparent. As mentioned previously, numerous times such data are not made available due to a fear of criticism. Third, causal connections between policies and outcomes should be established. Finally, if considerations other than merit are important, those arguments should be made public and modifications examined to measure the impact of policy adjustments.
We trace the origins of testing to its civil service roots in Xia Dynasty China 4,000 years ago, to the Middle East in Biblical times, and to the monumental changes in psychometrics in latter half of the twentieth century. The early twentieth century witnessed the birth of the multiple-choice test and a focus on measuring cognitive ability rather than knowledge of content – influenced greatly by IQ and US Army placement testing. Multiple-choice tests provided an objectivity in scoring that had previously eluded the standard essays used in college entrance exams. The field of testing began to take notice of measurement errors and strove to minimize them. Computerized Adaptive Tests (CAT) were developed to accurately measure a person’s ability with the fewest number of items. The future advancement of testing is dependent on a continued process of experimentation to determine what improves and what does not.
Armed services tests have existed for centuries. We focus on the US Armed Services and how the tests used have adapted to changed claims associated with changing needs and purposes of the tests. World War I provided the impetus for the first serious military testing program. An all-star group of psychologists convened in Vineland, New Jersey and quickly constructed Army Alpha, which became a model for later group-administered, objective, multiple-choice tests. Military testing was the first program to explicitly move from very specialized tests for specific purposes to testing generalized underlying ability. This made such tests suitable for situations not even considered initially. The practice was both widely followed and just as widely disparaged. The AGCT, AFQT, and ASVAB were later versions of this initial test. Army Alpha also influenced the creation of the SAT, ACT, GRE, LSAT, and MCAT tests. Decisions based on military tests, like all tests, can be controversial. In 1965, Project 100,000 lowered the cut score and resulted in thousands of low-scoring men being drafted, many of whom later died fighting in Vietnam.
In both K-12 and higher education, it is common to use test scores in deciding which students receive scholarships and other awards. As with placement decisions, this practice is also controversial due to issues of equity. We discuss the evidence supporting test scores as an aid in making such decisions including the costs of finding suitable winners, the costs of false positives, and the costs of false negatives. The G.I. Bill and United Negro College Fund provided scholarships for soldiers returning from World War II. Athletic scholarships have been around since 1952 and show an incredible disparity by race, favoring Black students. In 1955, the National Merit Scholarship (NMS) program was created to support students. The award is not much financially (~$2,500) but other sources of support usually follow students who score high enough to warrant merit. Like other tests, the PSAT/NMSQT shows race differences. States have addressed this differently, with some ranking students by district or school rather by state, resulting in more minorities receiving awards. Evidence suggests that such rankings within schools rather than statewide result in students with lower scores receiving awards but not doing as well academically as others who score higher and yet do not receive awards. Issues of fairness in testing remain.
National digital ID apps are increasingly gaining popularity globally. As how we transact in the world is increasingly mediated by the digital, questions need to be asked about how these apps support the inclusion of disabled people. In particular, international instruments, such as the United Nations Convention on the Rights of Persons with Disabilities, spotlight the need for inclusive information and communication technologies. In this paper, we adopt a critical disability studies lens to analyse the workings of state-designed digital IDs—Singpass app—and what they can tell us about existing ways of designing for digital inclusion. We situate the case of the Singpass app within the rise of global digital transactions and the political-technical infrastructures that shape their accessibility. We analyse the ways Singpass centres disability, the problems it may still entail, and the possible implications for inclusion. At the same time, we uncover the lessons Singpass’s development holds for questions of global digital inclusion.
As digital welfare systems expand in local governments worldwide, understanding their implications is crucial for safeguarding public values like transparency, legitimacy, accountability, and privacy. A lack of political debate on data-driven technologies risks eroding democratic legitimacy by obscuring decision-making and impeding accountability mechanisms. In the Netherlands, political discussions on digital welfare within local governments are surprisingly limited, despite evidence of negative impacts on both frontline professionals and citizens. This study examines what mechanisms explain if and how data-driven technologies in the domain of work and income are politically discussed within the municipal government of a large city in the Netherlands, and its consequences. Using a sequential mixed methods design, combining automated text-analysis software ConText (1.2.0) and text-analysis software Atlas.ti (9), we analyzed documents and video recordings of municipal council and committee meetings from 2016 to 2023. Our results show these discussions are rare in the municipal council, occurring primarily either in reaction to scandals, or in reaction to criticism. Two key discursive factors used to justify limited political discussion are: (1) claims of lacking time and knowledge among council members and aldermen, and (2) distancing responsibility and diffusing accountability. This leads to a ‘content chopping’ mechanism, where issues are chopped into small content pieces, for example technical, ethical, and political aspects, and spreading them into separate documents and discussion arenas. This fragmentation can obscure overall coherence and diffuse critical concerns, potentially leading to harmful effects like dehumanization and stereotyping.
The Pósa–Seymour conjecture determines the minimum degree threshold for forcing the $k$th power of a Hamilton cycle in a graph. After numerous partial results, Komlós, Sárközy, and Szemerédi proved the conjecture for sufficiently large graphs. In this paper, we focus on the analogous problem for digraphs and for oriented graphs. We asymptotically determine the minimum total degree threshold for forcing the square of a Hamilton cycle in a digraph. We also give a conjecture on the corresponding threshold for $k$th powers of a Hamilton cycle more generally. For oriented graphs, we provide a minimum semi-degree condition that forces the $k$th power of a Hamilton cycle; although this minimum semi-degree condition is not tight, it does provide the correct order of magnitude of the threshold. Turán-type problems for oriented graphs are also discussed.