Skip to main content Accessibility help

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

  • Matthew J. Denny (a1) and Arthur Spirling (a2)


Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.


Corresponding author


Hide All

Authors’ note: We thank Will Lowe, Scott de Marchi and Brandon Stewart for comments on an earlier draft, and Pablo Barbera for providing the Twitter data used in this paper. Audiences at New York University, University of California San Diego, the Political Methodology meeting (2017), Duke University, University of Michigan, and the International Methods Colloquium provided helpful comments. Suggestions from the editor of Political Analysis, and two anonymous referees, allowed us to improve our article considerably. This research was supported by the National Science Foundation under IGERT Grant DGE-1144860. Replication data for this paper are available via Denny and Spirling (2017). preText software available here:

Contributing Editor: R. Michael Alvarez



Hide All
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:9931022.
Buckland, S. T., Burnham, K. P., and Augustin, N. H.. 1997. Model selection: An integral part of inference. Biometrics 53(2):603618.
Catalinac, Amy. 2016. Pork to policy: The rise of programmatic campaigning in Japanese elections. Journal of Politics 78(1):118.
Chang, Jonathan, Gerrish, Sean, Wang, Chong, Boyd-graber, Jordan L., and Blei, David M.. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems 22 , ed. Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A.. Curran Associates, Inc., pp. 288296.
Denny, Matthew, and Spirling, Arthur. 2017. “Dataverse replication data for: text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about It.”
Diermeier, Daniel, Godbout, Jean-François, Yu, Bei, and Kaufmann, Stefan. 2011. Language and ideology in congress. British Journal of Political Science 42(01):3155.
D’Orazio, Vito, Landis, Steven, Palmer, Glenn, and Schrodt, Philip. 2014. Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis 22(2):224242.
Gelman, Andrew. 2013. Preregistration of studies and mock reports. Political Analysis 21(1):4041.
Gelman, Andrew, and Loken, Eric. 2014. The statistical crisis in science. American Scientist 102(6):460465.
Grimmer, J. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis 18(1):135.
Grimmer, Justin, and Stewart, Brandon M.. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3):267297.
Grimmer, Justin, and King, Gary. 2011. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences of the United States of America 108(7):26432650.
Handler, Abram, Denny, Matthew J., Wallach, Hanna, and O’Connor, Brendan. 2016. Bag of what? Simple noun phrase extraction for text analysis. Proceedings of the workshop on natural language processing and computational social science at the 2016 conference on empirical methods in natural language processing ,
Hopkins, Daniel, and King, Gary. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229247.
James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An introduction to statistical learning . New York: Springer.
Jensen, David D., and Cohen, Paul R.. 2000. Multiple comparisons in induction algorithms. Machine Learning 38(3):309338.
Jones, Tudor. 1996. Remaking the labour party: From gaitskell to blair . New York: Routledge.
Jurafsky, Daniel, and Martin, James H.. 2008. Speech and language processing: An introduction to natural language processing computational linguistics and speech recognition . Prentice Hall.
Justeson, John S., and Katz, Slava M.. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(01):927.
Kavanagh, Dennis. 1997. The reordering of British politics: Politics after thatcher . Oxford University Press.
King, Gary, Lam, Patrick, and Roberts, Margaret E. 2017. Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science . Preprint,
Lauderdale, Benjamin, and Herzog, Alexander. 2016. Measuring political positions from legislative speech. Political Analysis 24(2):121.
Laver, Michael, Benoit, Kenneth, and Garry, John. 2003. Extracting policy positions from political texts using words as data. The American Political Science Review 97(2):311331.
Lowe, Will. 2008. Understanding wordscores. Political Analysis 16(4 SPEC. ISS.):356371.
Lowe, Will, and Benoit, Kenneth. 2013. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis 21(3):298313.
Manning, Christopher D., and Schütze, Hinrich. 1999. Foundations of statistical natural language processing . MIT Press.
Manning, Christopher D., Raghavan, Prabhakar, and Schütze, Hinrich. 2008. An introduction to information retrieval . Cambridge: Cambridge University Press.
Monroe, Burt L., Colaresi, Michael P., and Quinn, Kevin M.. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16:372403.
Moore, Ryan, Powell, Elinor, and Reeves, Andrew. 2013. Driving support: workers, PACs, and congressional support of the auto industry. Business and Politics 15(2):137162.
Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar. 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the conference on empirical methods in natural language processing (EMNLP) , pp. 7986.
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3):130137.
Proksch, Sven-Oliver, and Slapin, Jonathan B.. 2010. Position taking in european parliament speeches. British Journal of Political Science 40(03):587611.
Pugh, Martin. 2011. Speak for Britain!: A new history of the labour party . New York: Random House.
Quinn, Kevin M., Monroe, Burt L., Colaresi, Michael, Crespin, Michael H., and Radev, Dragomir R.. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1):209228.
Roberts, Margaret E., Stewart, Brandon M., Tingley, Dustin, Lucas, Christopher, Leder-Luis, Jetson, Gadarian, Shana Kushner, Albertson, Bethany, and Rand, David G.. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58(4):10641082.
Sebastiani, Fabrizio. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1):147.
Slapin, Jonathan B., and Proksch, Sven-Oliver. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52:705722.
Spirling, Arthur. 2012. U.S. treaty making with American Indians: Institutional change and relative power, 1784–1911. American Journal of Political Science 56(1):8497.
Steegen, Sara, Tuerlinckx, Francis, Gelman, Andrew, and Vanpaemel, Wolf. 2016. Increasing transparency through a multiverse analysis. Perspectives on Psychological Science 11(5):702712.
Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. 2009. Evaluation methods for topic models. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09 (4):1–8.
Yano, Tae, Smith, Noah a, and Wilkerson, John D. 2012. Textual predictors of bill survival in congressional committees. Conference of the North American chapter of the association for computational linguistics , pp. 793802.
MathJax is a JavaScript display engine for mathematics. For more information see


Type Description Title
Supplementary materials

Denny and Spirling supplementary material 1
Online Appendix

 Unknown (185 KB)
185 KB

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

  • Matthew J. Denny (a1) and Arthur Spirling (a2)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed