An Improved Method of Automated Nonparametric Content Analysis for Social Science

Connor T. Jerzak; Gary King; Anton Strezhnev

doi:10.1017/pan.2021.36

An Improved Method of Automated Nonparametric Content Analysis for Social Science

Published online by Cambridge University Press: 07 January 2022

and

Connor T. Jerzak: Affiliation:
Ph.D. Candidate and Carl J. Friedrich Fellow, Department of Government, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. E-mail: cjerzak@g.harvard.edu, URL: https://ConnorJerzak.com
Gary King*: Affiliation:
Albert J. Weatherhead III University Professor, Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. URL: https://GaryKing.org
Anton Strezhnev: Affiliation:
Assistant Professor, University of Chicago, Department of Political Science, 5828 S. University Avenue, Chicago, IL 60637, USA. E-mail: astrezhnev@uchicago.edu, URL: https://antonstrezhnev.com
*: Corresponding author Gary King

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Some scholars build models to classify documents into chosen categories. Others, especially social scientists who tend to focus on population characteristics, instead usually estimate the proportion of documents in each category—using either parametric “classify-and-count” methods or “direct” nonparametric estimation of proportions without individual classification. Unfortunately, classify-and-count methods can be highly model-dependent or generate more bias in the proportions even as the percent of documents correctly classified increases. Direct estimation avoids these problems, but can suffer when the meaning of language changes between training and test sets or is too similar across categories. We develop an improved direct estimation approach without these issues by including and optimizing continuous text features, along with a form of matching adapted from the causal inference literature. Our approach substantially improves performance in a diverse collection of 73 datasets. We also offer easy-to-use software that implements all ideas discussed herein.

Keywords

quantification natural language processing non-parametric statistics

Information

Type: Article
Information: Political Analysis , Volume 31 , Issue 1 , January 2023 , pp. 42 - 58

DOI: https://doi.org/10.1017/pan.2021.36 [Opens in a new window]
Copyright: © The Author(s) 2022. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Edited by Jeff Gill

References

Brunzell, H., and Eriksson, J.. 2000. “Feature Reduction for Classification of Multidimensional Data.” Pattern Recognition 33(10):1741–1748. https://doi.org/10.1016/S0031-3203(99)00142-9, bit.ly/2ihoYdl.CrossRef Google Scholar

Buck, A. A., and Gart, J. J.. 1966. “Comparison of a Screening Test and a Reference Test in Epidemiologic Studies. I. Indices of Agreements and Their Relation to Prevalence.” American Journal of Epidemiology 83(3):586–592.CrossRef Google Scholar

Ceron, A., Curini, L., and Iacus, S. M.. 2016. “iSA: A Fast, Scalable and Accurate Algorithm for Sentiment Analysis of Social Media Content.” Information Sciences 367:105–124.CrossRef Google Scholar

Denny, M. J., and Spirling, A.. 2018. “Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It.” Political Analysis 26(2):168–189.CrossRef Google Scholar

Esuli, A., and Sebastiani, F.. 2015. “Optimizing Text Quantifiers for Multivariate Loss Functions.” ACM Transactions on Knowledge Discovery from Data 9(4):27.CrossRef Google Scholar

Firat, A. 2016. “Unified Framework for Quantification”. Preprint, arXiv:1606.00868.Google Scholar

Forman, G. 2007. “Quantifying Counts, Costs, and Trends Accurately via Machine Learning.” Technical report, HP Laboratories, Palo Alto. bit.ly/Forman07 Google Scholar

Gama, J., et al. 2014. “A Survey on Concept Drift Adaptation.” ACM Computing Surveys 46(4):44.CrossRef Google Scholar

Hand, D. J. 2006. “Classifier Technology and the Illusion of Progress.” Statistical Science 21(1):1–14.Google Scholar

Ho, D. E., Imai, K., King, G., and Stuart, E. A.. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15:199–236. j.mp/matchP.CrossRef Google Scholar

Hoadley, B. 2001. “[Statistical Modeling: The Two Cultures]: Comment.” Statistical Science 16(3):220–224.Google Scholar

Hopkins, D., and King, G.. 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54(1):229–247. j.mp/jNFDgI.CrossRef Google Scholar

Hopkins, D., King, G., Knowles, M., and Melendez, S.. 2013. “Readme: Software for Automated Content Analysis.” Versions 2007–2013. GaryKing.org/readme Google Scholar

Iacus, S. M., King, G., and Porro, G.. 2012. “Causal Inference without Balance Checking: Coarsened Exact Matching.” Political Analysis 20(1):1–24. j.mp/woCheck.CrossRef Google Scholar

James, G., Witten, D., Hastie, T., and Tibshirani, R.. 2013. An Introduction to Statistical Learning, Vol. 112. New York: Springer.CrossRef Google Scholar

Jerzak, C., King, G., and Strezhnev, A.. 2021. “Replication Data for: An Improved Method of Automated Nonparametric Content Analysis for Social Science.” https://doi.org/10.7910/DVN/AVNZR6, Harvard Dataverse, V1.CrossRef Google Scholar

Kar, P., et al. 2016. “Online Optimization Methods for the Quantification Problem”. Preprint, arXiv:1605.04135.Google Scholar

King, E., Gebbie, M., and Melosh, N. A.. 2019. “Impact of Rigidity on Molecular Self-Assembly.” Langmuir: The ACS Journal of Fundamental Interface Science 35(48):16062–16069.CrossRef Google Scholar PubMed

King, G., Hopkins, D., and Lu, Y.. 2012. “System for Estimating a Distribution of Message Content Categories in Source Data.” U.S. Patent 8,180,717. j.mp/VApatent Google Scholar

King, G., and Lu, Y.. 2008. “Verbal Autopsy Methods with Multiple Causes of Death.” Statistical Science 23(1):78–91. j.mp/2AuA8aN.CrossRef Google Scholar

King, G., Lu, Y., and Shibuya, K.. 2010. “Designing Verbal Autopsy Studies.” Population Health Metrics 8(19). https://doi.org/10.1186/1478-7954-8-19, j.mp/DAutopsy.CrossRef Google Scholar PubMed

King, G., and Nielsen, R. A.. 2017. “Why Propensity Scores Should Not Be Used for Matching.” Working Paper. http://j.mp/PSMnot Google Scholar

King, G., Pan, J., and Roberts, M. E.. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107:1–18. j.mp/LdVXqN.CrossRef Google Scholar

Kingma, D. P., and Ba, J.. 2017. “Adam: A Method for Stochastic Optimization.” Preprint, arXiv:1412.6980.Google Scholar

Levy, O., Goldberg, Y., and Dagan, I.. 2015. “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” Transactions of the Association for Computational Linguistics 3:211–225.CrossRef Google Scholar

Levy, P. S., and Kass, E. H.. 1970. “A Three Population Model for Sequential Screening for Bacteriuria.” American Journal of Epidemiology 91:148–154.CrossRef Google Scholar PubMed

Milli, L., et al. 2013. “Quantification Trees.” In 2013 IEEE 13th International Conference on Data Mining, 528–536. New York: IEEE Press.CrossRef Google Scholar

Pereyra, G., et al. 2017. “Regularizing Neural Networks by Penalizing Confident Output Distributions.” Preprint, arXiv:1701.06548.Google Scholar

Socher, R., et al. 2013. “Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1631–1642.Google Scholar

Srivastava, N., et al. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15(1):1929–1958.Google Scholar

Tasche, D. 2016. “Does Quantification without Adjustments Work?” Preprint, arXiv:1602.08780.Google Scholar

Templeton, A., and Kalita, J.. 2018. “Exploring Sentence Vector Spaces through Automatic Summarization.” In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 55–60. New York: IEEE.CrossRef Google Scholar

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A., and Bottou, L.. 2010. “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.” Journal of Machine Learning Research 11(Dec):3371–3408. bit.ly/2gPcedw.Google Scholar

Jerzak et al. Dataset

Dataset

https://doi.org/10.7910/DVN/AVNZR6

Link

Jerzak et al. supplementary material

PDF 487.9 KB

Article contents

An Improved Method of Automated Nonparametric Content Analysis for Social Science

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Jerzak et al. Dataset

Jerzak et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests