Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War

  • Wouter van Atteveldt (a1), Tamir Sheafer (a2), Shaul R. Shenhav (a3) and Yair Fogel-Dror (a3)


This article presents a new method and open source R package that uses syntactic information to automatically extract source–subject–predicate clauses. This improves on frequency-based text analysis methods by dividing text into predicates with an identified subject and optional source, extracting the statements and actions of (political) actors as mentioned in the text. The content of these predicates can be analyzed using existing frequency-based methods, allowing for the analysis of actions, issue positions and framing by different actors within a single text. We show that a small set of syntactic patterns can extract clauses and identify quotes with good accuracy, significantly outperforming a baseline system based on word order. Taking the 2008–2009 Gaza war as an example, we further show how corpus comparison and semantic network analysis applied to the results of the clause analysis can show differences in citation and framing patterns between U.S. and English-language Chinese coverage of this war.


Authors’ note: The research was partly supported by the Israel Science Foundation and the Ministry of Science, Technology, and Space, Israel. The data and R scripts for replicating the validation and substantive analyses are published in the Harvard Dataverse (Van Atteveldt, Sheafer, Shenhav, and Fogel-Dror 2016).

