Hostname: page-component-6766d58669-mzsfj Total loading time: 0 Render date: 2026-05-16T00:43:07.717Z Has data issue: false hasContentIssue false

Using Digitized Newspapers to Address Measurement Error in Historical Data

Published online by Cambridge University Press:  19 January 2024

Andreas Ferrara*
Affiliation:
Assistant Professor, University of Pittsburgh, Department of Economics, 4906 Wesley W. Posvar Hall, Pittsburgh, 230 South Bouquet Street, PA 15260 and NBER.
Joung Yeob Ha
Affiliation:
Ph.D. Candidate, University of Pittsburgh, Department of Economics, 4512 Wesley W. Posvar Hall, Pittsburgh, 230 South Bouquet Street, PA 15260. E-mail address: joh106@pitt.edu.
Randall Walsh
Affiliation:
Professor, University of Pittsburgh, Department of Economics, 4166 Wesley W. Posvar Hall, 230 South Bouquet Street, Pittsburgh, PA 15260 and NBER. E-mail address: walshr@pitt.edu.
*
E-mail address: a.ferrara@pitt.edu
Rights & Permissions [Opens in a new window]

Abstract

This paper shows how to remove attenuation bias in regression analyses due to measurement error in historical data for a given variable of interest by using a secondary measure that can be easily generated from digitized newspapers. We provide three methods for using this secondary variable to deal with non-classical measurement error in a binary treatment: set identification, bias reduction via sample restriction, and a parametric bias correction. We demonstrate the usefulness of our methods by replicating four recent economic history papers. Relative to the initial analyses, our results yield markedly larger coefficient estimates.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of the Economic History Association
Figure 0

Figure 1 ERRORS ON THE USDA MAP FOR THE ARRIVAL OF THE BOLL WEEVILNotes: Snipped of the USDA map for the arrival of the boll weevil provided by Hunter and Coad (1923). Each solid line marks the arrival year of the pest. Researchers typically overlay the lines onto a map of Southern counties and determine the arrival date by the line that covers most of the county area. The highlighted areas are where date lines cross in contradictory ways.Sources: Hunter and Coad (1923).

Figure 1

Figure 2 EVENT STUDY PLOT—TWFE AND SUN AND ABRAHAM (2021)Notes: Coefficient plot from an event study regression of %BW on an event indicator relative to the arrival of the boll weevil from the USDA map as well as county and state-by-year fixed effects. Each circle and diamond present the estimates β in Equation (2) using OLS and the estimator proposed by Sun and Abraham (2021), respectively. The sample consists of 911 infested counties in 13 Southern states. The omitted baseline period is ℓ = –1, which is one year before the arrival of the USDA map. The relative time period for the latest-infested counties is omitted as well for the estimates using Sun and Abraham (2021) due to the lack of never-infested counties in our sample. Standard errors are clustered at the county-level, and 95 percent confidence intervals are reported around the point estimates.Sources: Authors’ calculations from data in Hunter and Coad (1923) and Newspapers.com.

Figure 2

Figure 3 CHARACTERISTICS OF THE NEWSPAPER-BASED BOLL WEEVIL MEASURESNotes: Panel (a) plots the newspaper-based boll weevil measure (dashed line) and its smoothed five-year moving average (solid line) for Marion County over time between 1882 and 1932. The vertical lines indicate the boll weevil’s arrival from the USDA map and the predicted arrival, where MA(5) is the highest, respectively. Panel (b) shows a histogram of the distribution of the differences between the USDA map arrival year and the year of arrival predicted by the maximum of the MA(5) measure constructed from the newspaper data. This is a cross-sectional comparison between the two measures for the 911 counties in the South that were ever infested by the boll weevil to provide a summary measure of the average difference between the USDA and newspaper-based arrival date.Sources: Authors’ calculations from data in Hunter and Coad (1923) and Newspapers.com.

Figure 3

Table 1 TESTING FOR OBSERVABLE DETERMINANTS OF THE DIFFERENCES BETWEEN THE USDA AND NEWSPAPER-BASED BOLL WEEVIL ARRIVAL DATES

Figure 4

Table 2 REPLICATION OF CLAY, SCHMICK, AND TROESKEN (2019)—MAIN EFFECTS

Figure 5

Table 3 REPLICATION OF CLAY, SCHMICK, AND TROESKEN (2019)—MARGINAL EFFECTS AT THE 75TH PERCENTILE

Figure 6

Table 4 REPLICATION OF AGER, BRUECKNER, AND HERZ (2017)—MAIN EFFECT

Figure 7

Table 5 REPLICATION OF AGER, BRUECKNER, AND HERZ (2017)—MARGINAL EFFECTS AT THE 75TH PERCENTILE

Figure 8

Table 6 REPLICATION OF HILT AND RAHN (2020) AND HOWARD AND ORNAGHI (2021)

Supplementary material: File

Ferrara et al. supplementary material

Online Appendix

Download Ferrara et al. supplementary material(File)
File 3.7 MB