Hostname: page-component-6766d58669-fx4k7 Total loading time: 0 Render date: 2026-05-23T15:18:35.101Z Has data issue: false hasContentIssue false

Detecting Fake People in Historical Records

Published online by Cambridge University Press:  17 December 2025

Neil Duzett
Affiliation:
Texas A&M University, College Station, TX, USA
Tammy Hepps
Affiliation:
Storyworth, USA
Allen Otterstrom
Affiliation:
University of Chicago Booth School of Business, Chicago, IL, USA
Joseph Price*
Affiliation:
Brigham Young University, Provo, UT, USA
*
Corresponding author: Joseph Price; Email: joe_price@byu.edu
Rights & Permissions [Opens in a new window]

Abstract

Data quality is a key input in efforts to link individuals across census records. We examine the extreme case of low data quality by identifying US census enumerators who fabricated entire families. We provide clear evidence of fake people included in the 1920 US Census for Homestead, Pennsylvania. We use the features of this case study to identify other places where information in the census may have been falsified. We develop an automated approach that identifies census sheets that have much lower match rates to other census records than would be expected, given the characteristics of the people recorded on each sheet. We perform a hand-check on the suspicious sheets using standard genealogy tools and identify at least 90 sheets where the entire census sheet appears to have been fabricated.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Social Science History Association
Figure 0

Figure 1. Enumeration pay rate table.Notes: This table shows how much a census enumerator would be compensated for their work. Their enumeration districts were given a specific designation, and they were paid accordingly. One group of designations resulted in pay entirely based on the number of people enumerated, one was entirely per diem, and one was a mixed rate: partly per diem and partly per person. The right columns show how many districts fall into each category.Source: Annual Report of the Director of the Census to the Secretary of Commerce for the Fiscal Year Ended June 30, 1920. 1920. https://search.proquest.com/docview/57950003.

Figure 1

Figure 2. Mapped households in Enumeration District 144.Notes: This map depicts the households in Silverstein’s enumeration. The medium and darkest green pins on the right represent households matched to other records; the lightest green pins on the left represent households without record matching. (Silverstein was not the only confused enumerator in Homestead: a medium green pin indicates that the household was duplicated in one other enumeration district, and a dark green pin indicates duplication in two other enumeration districts.) The light red indicates unmatched households where there is not enough information to be sure that the household was fabricated, leaving the darker red pins as the definite fakes. The black line shows the boundary of Enumeration District 144, Silverstein’s assigned area. All of the pins outside of the boundary are the households he was not supposed to canvas. Thirty-two of Silverstein’s fabricated households are listed with addresses that never existed. These households, therefore, cannot appear on this map.

Figure 2

Figure 3. Mapped households in the 1918 homestead directory.Note: A side-by-side comparison of this map of all the households in the 1918 Homestead city directory with Silverstein’s enumeration of the same blocks (in Figure 2) shows just how many households he skipped entirely and did not even attempt to represent with fabricated data.

Figure 3

Figure 4. 1920 sheet match scores.Notes: This figure shows the number of 1920 Census sheets that fall at each level of match score. Match score is a measure of how well connected the people listed on a 1920 Census sheet are to other census years. For example, a sheet with a match score of 0.6 has people who, on average, appear in 60 percent of the census years that they are expected to be in.

Figure 4

Table 1. Factors used to predict census sheet match scores

Figure 5

Figure 5. Predicted Match Score Residuals.Notes: This figure shows the distribution of Predicted Match Score Residuals. These are found by taking the difference between the true Match Score for a 1920 census sheet and its predicted Match Score. Predicted Match Score is found by using the coefficients of a regression of Match Score on various predictors. A residual close to 0 means that a sheet had similar true and predicted Match Scores.

Figure 6

Figure 6. 1920 Census Manuscript Example.Notes: This figure shows an example of the 1920 Decennial Census of Population and Housing.Source: United States Census Bureau