Skip to main content Accessibility help
×
Home
Hostname: page-component-5cfd469876-mjnjv Total loading time: 0.236 Render date: 2021-06-23T16:10:03.077Z Has data issue: false Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "metricsAbstractViews": false, "figures": true, "newCiteModal": false, "newCitedByModal": true, "newEcommerce": true }

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Published online by Cambridge University Press:  19 March 2018

Matthew J. Denny
Affiliation:
203 Pond Lab, Pennsylvania State University, University Park, PA 16802, USA. Email: mdenny@psu.edu
Arthur Spirling
Affiliation:
Office 405, 19 West 4th St., New York University, New York, NY 10012, USA. Email: arthur.spirling@nyu.edu
Corresponding

Abstract

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Type
Articles
Copyright
Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology. 

Access options

Get access to the full version of this content by using one of the access options below.

Footnotes

Authors’ note: We thank Will Lowe, Scott de Marchi and Brandon Stewart for comments on an earlier draft, and Pablo Barbera for providing the Twitter data used in this paper. Audiences at New York University, University of California San Diego, the Political Methodology meeting (2017), Duke University, University of Michigan, and the International Methods Colloquium provided helpful comments. Suggestions from the editor of Political Analysis, and two anonymous referees, allowed us to improve our article considerably. This research was supported by the National Science Foundation under IGERT Grant DGE-1144860. Replication data for this paper are available via Denny and Spirling (2017). preText software available here: github.com/matthewjdenny/preText

Contributing Editor: R. Michael Alvarez

References

Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:9931022.Google Scholar
Buckland, S. T., Burnham, K. P., and Augustin, N. H.. 1997. Model selection: An integral part of inference. Biometrics 53(2):603618.CrossRefGoogle Scholar
Catalinac, Amy. 2016. Pork to policy: The rise of programmatic campaigning in Japanese elections. Journal of Politics 78(1):118.CrossRefGoogle Scholar
Chang, Jonathan, Gerrish, Sean, Wang, Chong, Boyd-graber, Jordan L., and Blei, David M.. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems 22 , ed. Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A.. Curran Associates, Inc., pp. 288296.Google Scholar
Denny, Matthew, and Spirling, Arthur. 2017. “Dataverse replication data for: text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about It.” http://dx.doi.org/10.7910/DVN/XRR0HM.CrossRefGoogle Scholar
Diermeier, Daniel, Godbout, Jean-François, Yu, Bei, and Kaufmann, Stefan. 2011. Language and ideology in congress. British Journal of Political Science 42(01):3155.CrossRefGoogle Scholar
D’Orazio, Vito, Landis, Steven, Palmer, Glenn, and Schrodt, Philip. 2014. Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis 22(2):224242.CrossRefGoogle Scholar
Gelman, Andrew. 2013. Preregistration of studies and mock reports. Political Analysis 21(1):4041.CrossRefGoogle Scholar
Gelman, Andrew, and Loken, Eric. 2014. The statistical crisis in science. American Scientist 102(6):460465.CrossRefGoogle Scholar
Grimmer, J. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis 18(1):135.CrossRefGoogle Scholar
Grimmer, Justin, and Stewart, Brandon M.. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3):267297.CrossRefGoogle Scholar
Grimmer, Justin, and King, Gary. 2011. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences of the United States of America 108(7):26432650.CrossRefGoogle ScholarPubMed
Handler, Abram, Denny, Matthew J., Wallach, Hanna, and O’Connor, Brendan. 2016. Bag of what? Simple noun phrase extraction for text analysis. Proceedings of the workshop on natural language processing and computational social science at the 2016 conference on empirical methods in natural language processing , https://brenocon.com/handler2016phrases.pdf.Google Scholar
Hopkins, Daniel, and King, Gary. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229247.CrossRefGoogle Scholar
James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An introduction to statistical learning . New York: Springer.CrossRefGoogle Scholar
Jensen, David D., and Cohen, Paul R.. 2000. Multiple comparisons in induction algorithms. Machine Learning 38(3):309338.CrossRefGoogle Scholar
Jones, Tudor. 1996. Remaking the labour party: From gaitskell to blair . New York: Routledge.Google Scholar
Jurafsky, Daniel, and Martin, James H.. 2008. Speech and language processing: An introduction to natural language processing computational linguistics and speech recognition . Prentice Hall.Google Scholar
Justeson, John S., and Katz, Slava M.. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(01):927.CrossRefGoogle Scholar
Kavanagh, Dennis. 1997. The reordering of British politics: Politics after thatcher . Oxford University Press.Google Scholar
King, Gary, Lam, Patrick, and Roberts, Margaret E. 2017. Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science . Preprint, https://gking.harvard.edu/files/gking/files/ajps12291_final.pdf.CrossRefGoogle Scholar
Lauderdale, Benjamin, and Herzog, Alexander. 2016. Measuring political positions from legislative speech. Political Analysis 24(2):121.CrossRefGoogle Scholar
Laver, Michael, Benoit, Kenneth, and Garry, John. 2003. Extracting policy positions from political texts using words as data. The American Political Science Review 97(2):311331.CrossRefGoogle Scholar
Lowe, Will. 2008. Understanding wordscores. Political Analysis 16(4 SPEC. ISS.):356371.CrossRefGoogle Scholar
Lowe, Will, and Benoit, Kenneth. 2013. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis 21(3):298313.CrossRefGoogle Scholar
Manning, Christopher D., and Schütze, Hinrich. 1999. Foundations of statistical natural language processing . MIT Press.Google Scholar
Manning, Christopher D., Raghavan, Prabhakar, and Schütze, Hinrich. 2008. An introduction to information retrieval . Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Monroe, Burt L., Colaresi, Michael P., and Quinn, Kevin M.. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16:372403.CrossRefGoogle Scholar
Moore, Ryan, Powell, Elinor, and Reeves, Andrew. 2013. Driving support: workers, PACs, and congressional support of the auto industry. Business and Politics 15(2):137162.CrossRefGoogle Scholar
Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar. 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the conference on empirical methods in natural language processing (EMNLP) , pp. 7986.Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3):130137.CrossRefGoogle Scholar
Proksch, Sven-Oliver, and Slapin, Jonathan B.. 2010. Position taking in european parliament speeches. British Journal of Political Science 40(03):587611.CrossRefGoogle Scholar
Pugh, Martin. 2011. Speak for Britain!: A new history of the labour party . New York: Random House.Google Scholar
Quinn, Kevin M., Monroe, Burt L., Colaresi, Michael, Crespin, Michael H., and Radev, Dragomir R.. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1):209228.CrossRefGoogle Scholar
Roberts, Margaret E., Stewart, Brandon M., Tingley, Dustin, Lucas, Christopher, Leder-Luis, Jetson, Gadarian, Shana Kushner, Albertson, Bethany, and Rand, David G.. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58(4):10641082.CrossRefGoogle Scholar
Sebastiani, Fabrizio. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1):147.CrossRefGoogle Scholar
Slapin, Jonathan B., and Proksch, Sven-Oliver. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52:705722.CrossRefGoogle Scholar
Spirling, Arthur. 2012. U.S. treaty making with American Indians: Institutional change and relative power, 1784–1911. American Journal of Political Science 56(1):8497.CrossRefGoogle Scholar
Steegen, Sara, Tuerlinckx, Francis, Gelman, Andrew, and Vanpaemel, Wolf. 2016. Increasing transparency through a multiverse analysis. Perspectives on Psychological Science 11(5):702712.CrossRefGoogle ScholarPubMed
Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. 2009. Evaluation methods for topic models. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09 (4):1–8. http://portal.acm.org/citation.cfm?doid=1553374.1553515.CrossRefGoogle Scholar
Yano, Tae, Smith, Noah a, and Wilkerson, John D. 2012. Textual predictors of bill survival in congressional committees. Conference of the North American chapter of the association for computational linguistics , pp. 793802.Google Scholar
Supplementary material: File

Denny and Spirling supplementary material 1

Online Appendix

Download Denny and Spirling supplementary material 1(File)
File 185 KB
86
Cited by

Send article to Kindle

To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It
Available formats
×

Send article to Dropbox

To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It
Available formats
×

Send article to Google Drive

To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It
Available formats
×
×

Reply to: Submit a response

Please enter your response.

Your details

Please enter a valid email address.

Conflicting interests

Do you have any conflicting interests? *