Hostname: page-component-8448b6f56d-gtxcr Total loading time: 0 Render date: 2024-04-23T17:18:10.345Z Has data issue: false hasContentIssue false

Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines

Published online by Cambridge University Press:  04 January 2017

Vito D'Orazio*
Affiliation:
Institute for Quantitative Social Science, Harvard University
Steven T. Landis
Affiliation:
National Center for Atmospheric Research, FL2-2096, 3450 Mitchell Lane, Boulder, CO 80301. e-mail: landis.steven@gmail.com
Glenn Palmer
Affiliation:
Department of Political Science, Pennsylvania State University. e-mail: gpalmer@psu.edu
Philip Schrodt
Affiliation:
Parus Analytical Systems, 100 N. Patterson St., State College, PA 16801. e-mail: schrodt735@gmail.com
*
e-mail: dorazio@iq.harvard.edu (corresponding author)
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Due in large part to the proliferation of digitized text, much of it available for little or no cost from the Internet, political science research has experienced a substantial increase in the number of data sets and large-n research initiatives. As the ability to collect detailed information on events of interest expands, so does the need to efficiently sort through the volumes of available information. Automated document classification presents a particularly attractive methodology for accomplishing this task. It is efficient, widely applicable to a variety of data collection efforts, and considerably flexible in tailoring its application for specific research needs. This article offers a holistic review of the application of automated document classification for data collection in political science research by discussing the process in its entirety. We argue that the application of a two-stage support vector machine (SVM) classification process offers advantages over other well-known alternatives, due to the nature of SVMs being a discriminative classifier and having the ability to effectively address two primary attributes of textual data: high dimensionality and extreme sparseness. Evidence for this claim is presented through a discussion of the efficiency gains derived from using automated document classification on the Militarized Interstate Dispute 4 (MID4) data collection project.

Type
Research Article
Copyright
Copyright © The Author 2014. Published by Oxford University Press on behalf of the Society for Political Methodology 

Footnotes

Authors' note: The authors would like to thank Emre Haptipoglu, Matthew Lane, and Michael Kenwick for their work on the MID4 project. We would also like to thank the editors of Political Analysis and the anonymous reviewers for their insight and constructive comments. Supplementary materials for this article are available on the Political Analysis Web site.

References

Adcock, R., and Collier, D. 2001. Measurement validity: A shared standard for qualitative and quantitative research. American Political Science Review 95(3): 529–46.CrossRefGoogle Scholar
Aggarwal, C. C., and Zhai, C. X. 2012. A survey of text classification algorithms. In Mining text data, eds. Aggarwal, C. C. and Zhai, C. X., 77129. New York: Springer.CrossRefGoogle Scholar
Barid, V. A. 2004. The effect of politically salient decisions on the U.S. Supreme Court's agenda. Journal of Politics 3(66): 755–72.Google Scholar
Basu, C., Hirsh, H., and Cohen, W. 1998. Recommendation as classification: Using social and content-based information in recommendation. AAAI/IAAI 714–20.Google Scholar
Bikel, D. M., and Castelli, V. 2008. Event matching using the transitive closure of dependency relations. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, HLT-Short '08, Stroudsburg, PA, USA, 145–48. Association for Computational Linguistics.CrossRefGoogle Scholar
Blair, D. C. 1992. Information retrieval and the philosophy of language. Computer Journal 35(3): 200207.CrossRefGoogle Scholar
Blair, D. C. 2003. Information retrieval and the philosophy of language. Annual Review of Information Science and Technology 37(1): 350.Google Scholar
Breiman, L. 2001. Random forests. Machine Learning 45(1): 532.CrossRefGoogle Scholar
Britt, B. L., Berry, M. W., Browne, M., Merrell, M. A., and Kolpack, J. 2008. Document classification techniques for automated technology readiness level analysis. Journal of the American Society for Information Science and Technology 59(4): 675–80.CrossRefGoogle Scholar
Burges, C. J. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): 121–67.CrossRefGoogle Scholar
Cardie, C., and Wilkerson, J. 2008. Text annotation for political science. Journal of Information Technology and Politics 5(1): 16.CrossRefGoogle Scholar
Cohen, W. W. 1996. Learning rules that classify e-mail. In AAAI Spring Symposium on Machine Learning in Information Access, (18): 25. California.Google Scholar
Cohen, W. W., and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS) 17(2): 141–73.CrossRefGoogle Scholar
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M. W. 2007. Feature selection methods for text classification. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.CrossRefGoogle Scholar
D'Orazio, V., Landis, S. T., Palmer, G., and Schrodt, P. 2013. Replication data for: Separating the wheat from the chaff: Applications of automated document classification using support vector machines. IQSS Dataverse Network. V1.Google Scholar
Duan, K.-B., and Keerthi, S. S. 2005. Which is the best multiclass SVM method? An empirical study. In Multiple classifier systems, eds. Oza, N. C., Polikar, R., Kittler, J., and Roli, F., Volume 3541 of Lecture Notes in Computer Science, 278–85. Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
Dumais, S., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management.CrossRefGoogle Scholar
Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3: 1289–305.Google Scholar
Frank, E., and Bouckaert, R. R. 2006. Naive bayes for text classification with unbalanced classes. In Knowledge Discovery in Databases: PKDD 2006, 503–10. Springer.Google Scholar
Gey, S., and Poggi, J.-M. 2006. Boosting and instability for regression trees. Computational Statistics and Data Analysis 50(2): 533–50.CrossRefGoogle Scholar
Ghosn, F., Palmer, G., and Bremer, S. A. 2004. The mid3 data set, 1993–2001: Procedures, coding rules, and description. Conflict Management and Peace Science 21(2): 133–54.CrossRefGoogle Scholar
Gochman, C. S., and Maoz, Z. 1984. Militarized interstate disputes, 1816–1976: Procedures, patterns, and insights. Journal of Conflict Resolution 28(4): 585616.CrossRefGoogle Scholar
Goertz, G. 2006. Social science concepts: A user's guide. Princeton, NJ: Princeton University Press.CrossRefGoogle Scholar
Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157–82.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. 2009. The elements of statistical learning: Data mining, inference, and prediction. 2nd ed. New York: Springer.CrossRefGoogle Scholar
Hoffart, J., Yosef, M. A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., and Weikum, G. 2011. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, 782–92. Association for Computational Linguistics.Google Scholar
Hopkins, D., and King, G. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1): 229–47.CrossRefGoogle Scholar
Human Security Report Project. 2012. Human security report: Sexual violence, education, and war: Beyond mainstream narrative. Vancouver: Human Security Press.Google Scholar
Joachims, T. 1996. A probabilistic analysis of the Rocchio algorithm with tfidf for text categorization. Technical report, DTIC Document.Google Scholar
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Tenth European Conference on Machine Learning.CrossRefGoogle Scholar
Joachims, T. 2002. Learning to classify text using support vector machines: Methods, theory and algorithms. Norwell, MA: Kluwer Academic Publishers.CrossRefGoogle Scholar
Jones, D. M., Bremer, S. A., and Singer, J. D. 1996. Militarized interstate disputes, 1816–1992: Rationale, coding rules, and empirical patterns. Conflict Management and Peace Science 15(2): 163213.CrossRefGoogle Scholar
Karatzoglou, A., Meyer, D., and Hornik, K. 2006. Support vector machines in R. Journal of Statistical Software 15(9): 128.CrossRefGoogle Scholar
Kohonen, T. 2001. Learning vector quantization. In Self-organizing maps, Volume 30 of Springer Series in Information Sciences, 245–61. Berlin, Heidelberg: Springer.Google Scholar
Kolari, P., Finin, T., and Joshi, A. 2006. SVMs for the blogosphere: Blog identification and splog detection. In American Association for Artificial Intelligence Spring Symposium on Computational Approaches to Analyzing Weblogs.Google Scholar
Koprinska, I., Poon, J., Clark, J., and Chan, J. 2007. Learning to classify e-mail. Information Sciences 177(10): 2167–87.CrossRefGoogle Scholar
Lewis, D. D. 1992. Representation and learning in information retrieval. PhD thesis, University of Massachusetts.Google Scholar
Liaw, A., and Wiener, M. 2002. Classification and regression by random forest. R News 2(3): 1822.Google Scholar
Lowe, W. 2008. Understanding wordscores. Political Analysis 16(4): 356–71.CrossRefGoogle Scholar
Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2: 159–65.CrossRefGoogle Scholar
Manning, C. D., Raghavan, P., and Schutze, H. 2008. Introduction to information retrieval. Cambridge, MA: Cambridge University Press.Google Scholar
Mayfield, J., and McNamee, P. 2003. Single n-gram stemming. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, New York, 415–16. ACM.CrossRefGoogle Scholar
Mohammad, S., and Hirst, G. 2006. Distributional measures of concept-distance: A task-oriented evaluation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP 2006, Stroudsburg, PA, 3543. Association for Computational Linguistics.CrossRefGoogle Scholar
Monroe, B. L., Colaresi, M. P., and Quinn, K. M. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372403.CrossRefGoogle Scholar
Nadeau, D., and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1): 326.CrossRefGoogle Scholar
Paice, C. D. 1994. An evaluation method for stemming algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, New York, 4250. New York: Springer.Google Scholar
Papka, R., and Allan, J. 1998. Document classification using multiword features. In Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM 1998, New York, 1–8. ACM.CrossRefGoogle Scholar
Poole, K. T., and Rosenthal, H. 1991a. On dimensionalizing roll call votes in the U.S. Congress. American Political Science Review 85(3): 955–76.CrossRefGoogle Scholar
Poole, K. T., and Rosenthal, H. 1991b. Patterns of Congressional voting. American Journal of Political Science 35(1): 228–78.CrossRefGoogle Scholar
Poole, K. T., and Rosenthal, H. 2007. Ideology and Congress. New Brunswick, NJ: Transaction Publishers.Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3): 130–37.CrossRefGoogle Scholar
Rennie, J. D., Shih, L., Teevan, J., and Karger, D. R. 2003. Tackling the poor assumptions of naive bayes text classifiers. Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC.Google Scholar
Rijsbergen, C. V. 1979. Information retrieval. London: Butterworth-Heinemann Press.Google Scholar
Rocchio, J. J. 1971. Relevance feedback in information retrieval. In The SMART retrieval system: Experiments in automatic document processing, ed. Salton, G. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
Rubin, T. N., America, C., Smyth, P., and Steyvers, M. 2012. Statistical topic models for multi-label document classification. Machine Learning 88 (1–2): 157208.CrossRefGoogle Scholar
Salton, G., and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5): 513–23.CrossRefGoogle Scholar
Sartori, G. 1984. Social science concepts: A systematic analysis. Beverly Hills, CA: Sage.Google Scholar
Schrodt, P. A., Palmer, G., and Haptipoglu, M. E. 2008. Automated detection of reports of militarized interstate disputes: The SVM document classification algorithm. Presented at the Annual Meeting of the American Political Science Association, Toronto, Canada.Google Scholar
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1): 147.CrossRefGoogle Scholar
Shulman, S. W. 2005. E-rulemaking: Issues in current research and practice. International Journal of Public Administration 28 (7–8): 621–41.CrossRefGoogle Scholar
Sindhwani, V., and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, 477484.Google Scholar
Spirling, A. 2012. U.S. treaty making with American Indians: Institutional change and relative power, 1784–1911. American Journal of Political Science 56(1): 8497.CrossRefGoogle Scholar
Taskar, B., Segal, E., and Koller, D. 2001. Probabilistic classification and clustering in relational data. In International Joint Conference on Artificial Intelligence, Vol. 17, 870–78. Lawrence Erlbaum Associates LTD.Google Scholar
Vapnik, V. N. 1995. The nature of statistical learning theory. New York: Springer.CrossRefGoogle Scholar
Vapnik, V. N. 1998. Statistical learning theory. New York: John Wiley and Sons.Google Scholar
Wang, T., and Hirst, G. 2012. Exploring patterns in dictionary definitions for synonym extraction. Natural Language Engineering 18: 313–42.CrossRefGoogle Scholar
Witsenburg, K. M., and Adano, W. R. 2009. Of rain and raids: Violent livestock raiding in northern Kenya. Civil Wars 4(11): 514–38.Google Scholar
Yang, Y., and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, San Francisco, CA, 412–20. Morgan Kaufmann Publishers.Google Scholar
Yu, B., Kaufmann, S., and Diermeier, D. 2008. Classifying party affiliation from political speech. Journal of Information Technology and Politics 5(1): 3348.CrossRefGoogle Scholar
Zhang, T., and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Information Retrieval 4(1): 531.CrossRefGoogle Scholar
Supplementary material: PDF

D'Orazio et al. supplementary material

Appendix

Download D'Orazio et al. supplementary material(PDF)
PDF 45.5 KB