Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines

Vito D'Orazio; Steven T. Landis; Glenn Palmer; Philip Schrodt

doi:10.1093/pan/mpt030

Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines

Published online by Cambridge University Press: 04 January 2017

Glenn Palmer and

Vito D'Orazio*: Affiliation:
Institute for Quantitative Social Science, Harvard University
Steven T. Landis: Affiliation:
National Center for Atmospheric Research, FL2-2096, 3450 Mitchell Lane, Boulder, CO 80301. e-mail: landis.steven@gmail.com
Glenn Palmer: Affiliation:
Department of Political Science, Pennsylvania State University. e-mail: gpalmer@psu.edu
Philip Schrodt: Affiliation:
Parus Analytical Systems, 100 N. Patterson St., State College, PA 16801. e-mail: schrodt735@gmail.com
*: e-mail: dorazio@iq.harvard.edu (corresponding author)

Article contents

Abstract
Footnotes
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the 'Save PDF' action button.

Due in large part to the proliferation of digitized text, much of it available for little or no cost from the Internet, political science research has experienced a substantial increase in the number of data sets and large-n research initiatives. As the ability to collect detailed information on events of interest expands, so does the need to efficiently sort through the volumes of available information. Automated document classification presents a particularly attractive methodology for accomplishing this task. It is efficient, widely applicable to a variety of data collection efforts, and considerably flexible in tailoring its application for specific research needs. This article offers a holistic review of the application of automated document classification for data collection in political science research by discussing the process in its entirety. We argue that the application of a two-stage support vector machine (SVM) classification process offers advantages over other well-known alternatives, due to the nature of SVMs being a discriminative classifier and having the ability to effectively address two primary attributes of textual data: high dimensionality and extreme sparseness. Evidence for this claim is presented through a discussion of the efficiency gains derived from using automated document classification on the Militarized Interstate Dispute 4 (MID4) data collection project.

Information

Type: Research Article
Information: Political Analysis , Volume 22 , Issue 2 , Spring 2014 , pp. 224 - 242

DOI: https://doi.org/10.1093/pan/mpt030 [Opens in a new window]
Copyright: Copyright © The Author 2014. Published by Oxford University Press on behalf of the Society for Political Methodology

Footnotes

Authors' note: The authors would like to thank Emre Haptipoglu, Matthew Lane, and Michael Kenwick for their work on the MID4 project. We would also like to thank the editors of Political Analysis and the anonymous reviewers for their insight and constructive comments. Supplementary materials for this article are available on the Political Analysis Web site.

References

Adcock, R., and Collier, D. 2001. Measurement validity: A shared standard for qualitative and quantitative research. American Political Science Review 95(3): 529–46.CrossRef Google Scholar

Aggarwal, C. C., and Zhai, C. X. 2012. A survey of text classification algorithms. In Mining text data, eds. Aggarwal, C. C. and Zhai, C. X., 77–129. New York: Springer.CrossRef Google Scholar

Barid, V. A. 2004. The effect of politically salient decisions on the U.S. Supreme Court's agenda. Journal of Politics 3(66): 755–72.Google Scholar

Basu, C., Hirsh, H., and Cohen, W. 1998. Recommendation as classification: Using social and content-based information in recommendation. AAAI/IAAI 714–20.Google Scholar

Bikel, D. M., and Castelli, V. 2008. Event matching using the transitive closure of dependency relations. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short '08, Stroudsburg, PA, USA, 145–48. Association for Computational Linguistics.CrossRef Google Scholar

Blair, D. C. 1992. Information retrieval and the philosophy of language. Computer Journal 35(3): 200–207.CrossRef Google Scholar

Blair, D. C. 2003. Information retrieval and the philosophy of language. Annual Review of Information Science and Technology 37(1): 3–50.Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45(1): 5–32.CrossRef Google Scholar

Britt, B. L., Berry, M. W., Browne, M., Merrell, M. A., and Kolpack, J. 2008. Document classification techniques for automated technology readiness level analysis. Journal of the American Society for Information Science and Technology 59(4): 675–80.CrossRef Google Scholar

Burges, C. J. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): 121–67.CrossRef Google Scholar

Cardie, C., and Wilkerson, J. 2008. Text annotation for political science. Journal of Information Technology and Politics 5(1): 1–6.CrossRef Google Scholar

Cohen, W. W. 1996. Learning rules that classify e-mail. In AAAI Spring Symposium on Machine Learning in Information Access, (18): 25. California.Google Scholar

Cohen, W. W., and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS) 17(2): 141–73.CrossRef Google Scholar

Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M. W. 2007. Feature selection methods for text classification. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRef Google Scholar

D'Orazio, V., Landis, S. T., Palmer, G., and Schrodt, P. 2013. Replication data for: Separating the wheat from the chaff: Applications of automated document classification using support vector machines. IQSS Dataverse Network. V1.Google Scholar

Duan, K.-B., and Keerthi, S. S. 2005. Which is the best multiclass SVM method? An empirical study. In Multiple classifier systems, eds. Oza, N. C., Polikar, R., Kittler, J., and Roli, F., Volume 3541 of Lecture Notes in Computer Science, 278–85. Berlin, Heidelberg: Springer.CrossRef Google Scholar

Dumais, S., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management.CrossRef Google Scholar

Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3: 1289–305.Google Scholar

Frank, E., and Bouckaert, R. R. 2006. Naive bayes for text classification with unbalanced classes. In Knowledge Discovery in Databases: PKDD 2006, 503–10. Springer.Google Scholar

Gey, S., and Poggi, J.-M. 2006. Boosting and instability for regression trees. Computational Statistics and Data Analysis 50(2): 533–50.CrossRef Google Scholar

Ghosn, F., Palmer, G., and Bremer, S. A. 2004. The mid3 data set, 1993–2001: Procedures, coding rules, and description. Conflict Management and Peace Science 21(2): 133–54.CrossRef Google Scholar

Gochman, C. S., and Maoz, Z. 1984. Militarized interstate disputes, 1816–1976: Procedures, patterns, and insights. Journal of Conflict Resolution 28(4): 585–616.CrossRef Google Scholar

Goertz, G. 2006. Social science concepts: A user's guide. Princeton, NJ: Princeton University Press.CrossRef Google Scholar

Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157–82.Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J. 2009. The elements of statistical learning: Data mining, inference, and prediction. 2nd ed. New York: Springer.CrossRef Google Scholar

Hoffart, J., Yosef, M. A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., and Weikum, G. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, 782–92. Association for Computational Linguistics.Google Scholar

Hopkins, D., and King, G. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1): 229–47.CrossRef Google Scholar

Human Security Report Project. 2012. Human security report: Sexual violence, education, and war: Beyond mainstream narrative. Vancouver: Human Security Press.Google Scholar

Joachims, T. 1996. A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. Technical report, DTIC Document.Google Scholar

Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Tenth European Conference on Machine Learning.CrossRef Google Scholar

Joachims, T. 2002. Learning to classify text using support vector machines: Methods, theory and algorithms. Norwell, MA: Kluwer Academic Publishers.CrossRef Google Scholar

Jones, D. M., Bremer, S. A., and Singer, J. D. 1996. Militarized interstate disputes, 1816–1992: Rationale, coding rules, and empirical patterns. Conflict Management and Peace Science 15(2): 163–213.CrossRef Google Scholar

Karatzoglou, A., Meyer, D., and Hornik, K. 2006. Support vector machines in R. Journal of Statistical Software 15(9): 1–28.CrossRef Google Scholar

Kohonen, T. 2001. Learning vector quantization. In Self-organizing maps, Volume 30 of Springer Series in Information Sciences, 245–61. Berlin, Heidelberg: Springer.Google Scholar

Kolari, P., Finin, T., and Joshi, A. 2006. SVMs for the blogosphere: Blog identification and splog detection. In American Association for Artificial Intelligence Spring Symposium on Computational Approaches to Analyzing Weblogs.Google Scholar

Koprinska, I., Poon, J., Clark, J., and Chan, J. 2007. Learning to classify e-mail. Information Sciences 177(10): 2167–87.CrossRef Google Scholar

Lewis, D. D. 1992. Representation and learning in information retrieval. PhD thesis, University of Massachusetts.Google Scholar

Liaw, A., and Wiener, M. 2002. Classification and regression by random forest. R News 2(3): 18–22.Google Scholar

Lowe, W. 2008. Understanding wordscores. Political Analysis 16(4): 356–71.CrossRef Google Scholar

Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2: 159–65.CrossRef Google Scholar

Manning, C. D., Raghavan, P., and Schutze, H. 2008. Introduction to information retrieval. Cambridge, MA: Cambridge University Press.Google Scholar

Mayfield, J., and McNamee, P. 2003. Single n-gram stemming. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, New York, 415–16. ACM.CrossRef Google Scholar

Mohammad, S., and Hirst, G. 2006. Distributional measures of concept-distance: A task-oriented evaluation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP 2006, Stroudsburg, PA, 35–43. Association for Computational Linguistics.CrossRef Google Scholar

Monroe, B. L., Colaresi, M. P., and Quinn, K. M. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372–403.CrossRef Google Scholar

Nadeau, D., and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1): 3–26.CrossRef Google Scholar

Paice, C. D. 1994. An evaluation method for stemming algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, New York, 42–50. New York: Springer.Google Scholar

Papka, R., and Allan, J. 1998. Document classification using multiword features. In Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM 1998, New York, 1–8. ACM.CrossRef Google Scholar

Poole, K. T., and Rosenthal, H. 1991a. On dimensionalizing roll call votes in the U.S. Congress. American Political Science Review 85(3): 955–76.CrossRef Google Scholar

Poole, K. T., and Rosenthal, H. 1991b. Patterns of Congressional voting. American Journal of Political Science 35(1): 228–78.CrossRef Google Scholar

Poole, K. T., and Rosenthal, H. 2007. Ideology and Congress. New Brunswick, NJ: Transaction Publishers.Google Scholar

Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3): 130–37.CrossRef Google Scholar

Rennie, J. D., Shih, L., Teevan, J., and Karger, D. R. 2003. Tackling the poor assumptions of naive bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC.Google Scholar

Rijsbergen, C. V. 1979. Information retrieval. London: Butterworth-Heinemann Press.Google Scholar

Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART retrieval system: Experiments in automatic document processing, ed. Salton, G. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar

Rubin, T. N., America, C., Smyth, P., and Steyvers, M. 2012. Statistical topic models for multi-label document classification. Machine Learning 88 (1–2): 157–208.CrossRef Google Scholar

Salton, G., and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5): 513–23.CrossRef Google Scholar

Sartori, G. 1984. Social science concepts: A systematic analysis. Beverly Hills, CA: Sage.Google Scholar

Schrodt, P. A., Palmer, G., and Haptipoglu, M. E. 2008. Automated detection of reports of militarized interstate disputes: The SVM document classification algorithm. Presented at the Annual Meeting of the American Political Science Association, Toronto, Canada.Google Scholar

Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1): 1–47.CrossRef Google Scholar

Shulman, S. W. 2005. E-rulemaking: Issues in current research and practice. International Journal of Public Administration 28 (7–8): 621–41.CrossRef Google Scholar

Sindhwani, V., and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 477–484.Google Scholar

Spirling, A. 2012. U.S. treaty making with American Indians: Institutional change and relative power, 1784–1911. American Journal of Political Science 56(1): 84–97.CrossRef Google Scholar

Taskar, B., Segal, E., and Koller, D. 2001. Probabilistic classification and clustering in relational data. In International Joint Conference on Artificial Intelligence, Vol. 17, 870–78. Lawrence Erlbaum Associates LTD.Google Scholar

Vapnik, V. N. 1995. The nature of statistical learning theory. New York: Springer.CrossRef Google Scholar

Vapnik, V. N. 1998. Statistical learning theory. New York: John Wiley and Sons.Google Scholar

Wang, T., and Hirst, G. 2012. Exploring patterns in dictionary definitions for synonym extraction. Natural Language Engineering 18: 313–42.CrossRef Google Scholar

Witsenburg, K. M., and Adano, W. R. 2009. Of rain and raids: Violent livestock raiding in northern Kenya. Civil Wars 4(11): 514–38.Google Scholar

Yang, Y., and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, San Francisco, CA, 412–20. Morgan Kaufmann Publishers.Google Scholar

Yu, B., Kaufmann, S., and Diermeier, D. 2008. Classifying party affiliation from political speech. Journal of Information Technology and Politics 5(1): 33–48.CrossRef Google Scholar

Zhang, T., and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Information Retrieval 4(1): 5–31.CrossRef Google Scholar

D'Orazio et al. supplementary material

Appendix

PDF 45.5 KB

Article contents

Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines

Abstract

Information

Footnotes

References

D'Orazio et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests