Skip to main content Accessibility help

Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War

  • Wouter van Atteveldt (a1), Tamir Sheafer (a2), Shaul R. Shenhav (a3) and Yair Fogel-Dror (a3)


This article presents a new method and open source R package that uses syntactic information to automatically extract source–subject–predicate clauses. This improves on frequency-based text analysis methods by dividing text into predicates with an identified subject and optional source, extracting the statements and actions of (political) actors as mentioned in the text. The content of these predicates can be analyzed using existing frequency-based methods, allowing for the analysis of actions, issue positions and framing by different actors within a single text. We show that a small set of syntactic patterns can extract clauses and identify quotes with good accuracy, significantly outperforming a baseline system based on word order. Taking the 2008–2009 Gaza war as an example, we further show how corpus comparison and semantic network analysis applied to the results of the clause analysis can show differences in citation and framing patterns between U.S. and English-language Chinese coverage of this war.


Corresponding author


Hide All

Authors’ note: The research was partly supported by the Israel Science Foundation and the Ministry of Science, Technology, and Space, Israel. The data and R scripts for replicating the validation and substantive analyses are published in the Harvard Dataverse (Van Atteveldt, Sheafer, Shenhav, and Fogel-Dror 2016).

Contributing Editor: R. Michael Alvarez



Hide All
Baker, C., Fillmore, C., and Cronin, B.. 2003. The structure of the framenet database. International Journal of Lexicography 16(3):281296.
Blei, D. M., Ng, A. Y., and Jordan, M. I.. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:9931022.
Carreras, X., and Màrquez, L.. 2005. Introduction to the conll-2005 shared task: semantic role labeling. In Proceedings of the ninth conference on computational natural language learning . Stroudsburg, PA: Association for Computational Linguistics, pp. 152164.
Chen, D., Schneider, N., Das, D., and Smith, N. A.. 2010. SEMAFOR: frame argument resolution with log-linear models. In Proceedings of the 5th international workshop on semantic evaluation . Stroudsburg, PA: Association for Computational Linguistics, pp. 264267.
Collingwood, L., and Wilkerson, J.. 2012. Tradeoffs in accuracy and efficiency in supervised learning methods. Journal of Information Technology & Politics 9(3):298318.
De Marneffe, M., MacCartney, B., and Manning, C.. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC , vol. 6, pp. 449454.
D’Orazio, V., Landis, S. T., Palmer, G., and Schrodt, P.. 2014. Separating the wheat from the chaff: applications of automated document classification using support vector machines. Political Analysis 22(2):224242.
Entman, R. M. 2008. Theorizing mediated public diplomacy: The U.S. case. International Journal of Press/Politics 13:87102.
Fellbaum, C., ed. 1998. WordNet: an electronic lexical database. Cambridge, MA: MIT Press.
Fogel-Dror, Y., Sheafer, T., Shenhav, S. R., and Van Atteveldt, W.. 2015. Real-time sentiment analysis in the context of a political conflict. In Annual meeting of the American political science association San-Francisco, CA .
Grimmer, J. 2010. A bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Political Analysis 18(1):135.
Grimmer, J., and Stewart, B. M.. 2013. Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21:267297.
Grossman, D. A., and Frieder, O.. 2012. Information retrieval: algorithms and heuristics , vol. 15, Springer Science & Business Media.
Hillard, D., Purpura, S., and Wilkerson, J.. 2008. Computer assisted topic classification for mixed methods social science research. Journal of Information Technology and Politics 4(4):3164.
Kellstedt, P. M. 2003. The mass media and the dynamics of American racial attitudes . New York: Cambridge University Press.
Laver, M., Benoit, K., and Garry, J.. 2003. Extracting policy positions from political texts using words as data. American Political Science Review 97(2):311331.
Lowe, W., and Benoit, K.. 2013. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis 21(3):298313.
Miller, G. 1995. WordNet: a lexical database for English . New York: ACM Press.
Monroe, B. L., Colaresi, M. P., and Quinn, K. M.. 2008. Fightin’words: lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4):372403.
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R.. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1):209228.
Roberts, M. E. 2015. Introduction to the virtual issue: recent innovations in text analysis for social science. Political Analysis 23:254277.
Ruigrok, N., and Van Atteveldt, W.. 2007. Global angling with a local angle: How U.S., British, and Dutch newspapers frame global and local terrorist attacks. The Harvard International Journal of Press/Politics 12:6890.
Schrodt, P. A.2014. TABARI: textual analysis by augmented replacement instructions, version 0.8.4b3;
Schrodt, P. A., and Gerner, D. J.. 1994. Validity assessment of a machine-coded event data set for the Middle East, 1982–1992. American Journal of Political Science 38(3):825854.
Schrodt, P. A., and Gerner, D. J.. 2000. Cluster-based early warning indicators for political change in the contemporary levant. American Political Science Review 94(4):803818.
Schrodt, P. A., Gerner, D. J., and Yilmaz, O.. 2005. Using event data to monitor contemporary conflict in the Israel-Palestine dyad. International Studies Perspectives 6(2):235251.
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1):147.
Sheafer, T., and Shenhav, S. R.. 2010. Mediated public diplomacy in a new era of warfare. The Communication Review 12:272283.
Sheafer, T., Shenhav, S. R., Takens, J., and Van Atteveldt, W.. 2014. Relative political and value proximity in mediated public diplomacy: the effect of state-level homophily on international frame building. Political Communication 31(1):149167.
Slapin, J. B., and Proksch, S.-O.. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3):705722.
Stone, P. J., Dunphy, D. C., Smith, M. S., and Ogilvie, D. M. et al. . 1966. The General Inquirer: a computer approach to content analysis . Cambridge, MA: MIT Press.
Van Atteveldt, W. 2008. Semantic network analysis: techniques for extracting, representing, and querying media content (dissertation) . Charleston, SC: BookSurge.
Van Atteveldt, W.2013. News media: platform or power broker? A study of political quotes in newspaper content using syntactic analysis. Presented at the New Directions in Analyzing Text as Data workshop, LSE, 27–28 September.
Van Atteveldt, W., Kleinnijenhuis, J., and Ruigrok, N.. 2008. Parsing, semantic networks, and political authority: using syntactic analysis to extract semantic relations from Dutch newspaper articles. Political Analysis 16(4):428446.
Van Atteveldt, W., Sheafer, T., Shenhav, S. R., and Fogel-Dror, Y.. 2016. Replication data for: clause analysis: using syntactic information to automatically extract source, subject, and predicate from texts with an application to the 2008–2009 Gaza War. doi:107910/DVN/DZZXAD, Harvard Dataverse, V1 [UNF:6:IdSlgh3RYlPHO1Hq0pCahQ==].
Young, L., and Soroka, S.. 2012. Affective news: the automated coding of sentiment in political texts. Political Communication 29(2):205231.
MathJax is a JavaScript display engine for mathematics. For more information see

Related content

Powered by UNSILO
Type Description Title
Supplementary materials

van Atteveldt supplementary material
van Atteveldt supplementary material

 Unknown (166 KB)
166 KB

Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War

  • Wouter van Atteveldt (a1), Tamir Sheafer (a2), Shaul R. Shenhav (a3) and Yair Fogel-Dror (a3)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.