Skip to main content

The Prevalence and Severity of Underreporting Bias in Machine- and Human-Coded Data

  • Benjamin E. Bagozzi, Patrick T. Brandt, John R. Freeman, Jennifer S. Holmes, Alisha Kim, Agustin Palao Mendizabal and Carly Potz-Nielsen...

Textual data are plagued by underreporting bias. For example, news sources often fail to report human rights violations. Cook et al. propose a multi-source estimator to gauge, and to account for, the underreporting of state repression events within human codings of news texts produced by the Agence France-Presse and Associated Press. We evaluate this estimator with Monte Carlo experiments, and then use it to compare the prevalence and seriousness of underreporting when comparable texts are machine coded and recorded in the World-Integrated Crisis Early Warning System dataset. We replicate Cook et al.’s investigation of human-coded state repression events with our machine-coded events, and validate both models against an external measure of human rights protections in Africa. We then use the Cook et al. estimator to gauge the seriousness and prevalence of underreporting in machine and human-coded event data on human rights violations in Colombia. We find in both applications that machine-coded data are as valid as human-coded data.

Hide All

Benjamin E. Bagozzi, Department of Political Science & International Relations, University of Delaware, 405 Smith Hall, 18 Amstel Ave, Newark, DE 19716 ( Patrick T. Brandt (, Jennifer S. Holmes (, Alisha Kim ( and Agustin Palao Mendizabal (, School of Economic, Political and Policy Sciences, University of Texas, Dallas, 800 W. Campbell Rd, GR31 Richardson TX 75080. John R. Freeman ( and Carly Potz-Nielsen (, Department of Political Science, University of Minnesota, 1414 Social Sciences, 267 19th Ave S., Minneapolis, MN 55455. An earlier version of this paper was presented as a poster at the 34th Annual Meeting of the Political Methodology Society. This research is supported by NSF Grant Number SBE-SMA-1539302. The authors thank Associate Editor Daniel Stegmueller, two anonymous reviewers, as well as Scott Cook, Mark Nieman, and Vito D’Orazio for their helpful comments and suggestions. To view supplementary material for this article, please visit

Hide All
Bagozzi, Benjamin E., and Berliner, Daniel. 2017. ‘The Politics of Scrutiny in Human Rights Monitoring: Evidence from Structural Topic Models of U.S. State Department Human Rights Reports’. Political Science Research and Methods, 1–17.
Beieler, John, Brandt, Patrick T., Halterman, Andrew, Schrodt, Philip A., and Simpson, Erin M.. 2016. ‘Generating Political Event Data in Near Real Time: Opportunities and Challenges’. In R. Michael Alvarez (ed.), Computational Social Science: Discovery and Prediction , 98–120. New York: Cambridge University Press.
Boschee, Elizabeth, Lautenschlager, Jennifer, O’Brien, Sean, Shellman, Steve, Starz, James, and Ward, Michael. 2016. ‘ICEWS Coded Event Data’, Harvard Dataverse. Available at, accessed 8 November 2016.
Centro de Investigación y Educación Popular (CINEP). 2008. ‘Marco Conceptual: Banco de Datos de Derechos Humanos y Violencia Política’. CINEP, Bogotá, Colombia.
Cook, Scott J., Blas, Betsabe, Carroll, Raymond J., and Sinha, Samiran. 2017. ‘Two Wrongs Don’t Make a Right: Addressing Underreporting in Binary Data from Multiple Sources’. Political Analysis 25(2):223240.
Fariss, Christopher J. 2014. ‘Respect for Human Rights has Improved Over Time: Modeling the Changing Standard of Accountability’. American Political Science Review 108(2):297316.
Grimmer, Justin, and Stewart, Brandon M.. 2013. ‘Text as Data: The Promise and Pitfalls of Automated Content Analysis Methods for Political Texts’. Political Analysis 21(3):267297.
Hendrix, Cullen S., Salehyan, Idean, Hamner, Jesse, Case, Christina, Linebarger, Christopher, Stull, Emily, and Williams, Jennifer. 2012. ‘Social Conflict in Africa: A New Database’. International Interactions 38(4):503511.
King, Gary, and Lowe, Will. 2003. ‘An Automated Information Extraction Tool for International Conflict Data with Performance as Good As Human Coders’. International Organization 57(3):617642.
Laver, Michael, Benoit, Kenneth, and Garry, John. 2003. ‘Extracting Policy Positions from Political Texts Using Words as Data’. American Political Science Review 97:23112331.
Schrodt, Philip A., and Van Brackle, David. 2013. ‘Automated Coding of Political Event Data’. In V. S. Subrahmanian (ed.), Handbook of Computational Approaches to Counterterrorism , 23–49. New York: Springer Press.
Slapin, Jonathan B., and Proksch, Sven-Oliver. 2008. ‘A Scaling Model for Estimating Time-Series Party Positions from Texts’. American Journal of Political Science 52(3):705722.
Sundberg, Ralph, and Melander, Erik. 2013. ‘Introducing the UCDP Georeferenced Event Dataset’. Journal of Peace Research 50(4):523532.
Ward, Michael D., and Beger, Andreas. 2017. ‘Lessons from Near Real-Time Forecasting of Irregular Leadership Changes’. Journal of Peace Research 54(2):141156.
Weidmann, Nils B. 2015. ‘On the Accuracy of Media-Based Conflict Event Data’. Journal of Conflict Resolution 59(6):11291149.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Science Research and Methods
  • ISSN: 2049-8470
  • EISSN: 2049-8489
  • URL: /core/journals/political-science-research-and-methods
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
Type Description Title
Supplementary materials

Bagozzi et al. Dataset

Supplementary materials

Bagozzi et al. supplementary material
Bagozzi et al. supplementary material 1

 PDF (502 KB)
502 KB


Full text views

Total number of HTML views: 2
Total number of PDF views: 17 *
Loading metrics...

Abstract views

Total abstract views: 84 *
Loading metrics...

* Views captured on Cambridge Core between 5th March 2018 - 18th March 2018. This data will be updated every 24 hours.