Skip to main content Accessibility help

Google Books Ngrams and Political Science: Two Validity Tests for a Novel Data Source

  • Sean Richey (a1) and J. Benjamin Taylor (a2)


Google Books Ngrams data are freely available and contain billions of words used in tens of millions of digitized books, which begin in the 1500s for some languages. We explore the benefits and pitfalls of these data by showing examples from comparative and American politics. Specifically, we show how usage of the phrase “political corruption” in Italian, French, German, and Hebrew books strongly correlates with Transparency International’s well-cited Corruption Index for France, Italy, German, and Israel. We also use Ngrams to show that the explosive growth in usage of the phrases “Asian American,” “Latino,” and “Hispanic” correlates with real-world changes in these populations after the Immigration and Nationality Act of 1965. These applications show that Ngram data correlate strongly with similar data from well-respected sources. This suggests that Ngrams has content validity and can be used as a proxy measure for previously difficult-to-research phenomena and questions.



Hide All
Bentley, R. Alexander, Garnett, Philip, O’Brien, Michael J., and Brock, William A.. 2012. “Word Diffusion and Climate Science.” PLoS ONE 7 (11): e47966.
Brown, Peter F., Della Pietra, Vincent J., deSouza, Peter V., Lai, Jenifer C., and Mercer, Robert L.. 1992. “Class-Based N-Gram Models of Natural Language.” Computational Linguistics 18 (4): 467–79.
Cavnar, William B., and Trenkle, John M.. 1994. “N-Gram–Based Text Categorization.” In Proceedings of SDAIR-94, Las Vegas, NV, 161–75.
Chen, Yunsong, and Yan, Fei. 2016. “Centuries of Sociology in Millions of Books.” The Sociological Review. Available at
Ferrante, Joan, and Brown, Prince Jr. 2001. The Social Construction of Race and Ethnicity in the United States, second edition. Upper Saddle River, NJ: Prentice Hall.
Golden, Miriam A., and Picci, Lucio. 2005. “Proposal for a New Measure of Corruption, Illustrated with Italian Data.” Economics & Politics 17 (1): 3775.
Greenfield, Patricia M. 2013. “The Changing Psychology of Culture from 1800 through 2000.” Psychological Science 24 (9): 1722–31.
Hassanpour, Navid. 2013. “Tracking the Semantics of Politics: A Case for Online Data Research in Political Science.” PS: Political Science & Politics 46 (2): 299306.
King, Gary, Lam, Patrick, and Roberts, Margaret E.. 2017. “Computer‐Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science 61 (4): 971–88.
Koplenig, Alexander. 2017. “The Impact of Lacking Metadata for the Measurement of Cultural and Linguistic Change Using the Google Ngram Datasets—Reconstructing the Composition of the German Corpus in Times of WWII.” Digital Scholarship in the Humanities 32 (1): 169–88.
Lancaster, Thomas D., and Montinola, Gabriella R.. 1997. “Toward a Methodology for the Comparative Study of Political Corruption.” Crime, Law and Social Change 27 (3–4): 185206.
Lin, Yuri, Michel, Jean-Baptiste, Lieberman, Erez Aiden, Orwant, Jon, Brockman, Will, and Petrov, Slav. 2012. “Syntactic Annotations for the Google Books Ngram Corpus.” In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, Stroudsburg, PA: Association for Computational Linguistics, 169174. Available at (accessed March 21, 2018).
Manovich, Len. 2012. “Trending: The Promises and Challenges of Big Social Data.” In Debates in the Digital Humanities, ed. Gold, Matthew K., 460–75. Minneapolis: University of Minnesota Press.
Michel, Jean-Baptiste, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82.
Ophir, Shai. 2010. “A New Type of Historical Knowledge.” The Information Society 26 (2): 144–50.
Orwant, Jon. 2012. “Ngram Viewer 2.0.” Google Research Blog. Available at (accessed July 19, 2016).
Pechenick, Eitan Adam, Danforth, Christopher M., and Dodds, Peter Sheridan. 2015. “Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.” PLOS ONE 10 (10): e0137041.
Roth, Steffen. 2014. “Fashionable Functions: A Google Ngram View of Trends in Functional Differentiation (1800–2000).” International Journal of Technology and Human Interaction 10 (2): 3458.
Shea, Daniel M., and Sproveri, Alex. 2012. “The Rise and Fall of Nasty Politics in America.” PS: Political Science & Politics 45 (3): 416–21.
Smedley, Audrey, and Smedley, Brian D.. 2005. “Race as Biology Is Fiction, Racism as a Social Problem Is Real: Anthropological and Historical Perspectives on the Social Construction of Race.” American Psychologist 60 (1): 1626.
The Authors Guild v. Google, Inc. 2016. 136 S. Ct. (US Supreme Court).
Zeng, Rong, and Greenfield, Patricia M.. 2015. “Cultural Evolution over the Last 40 Years in China: Using the Google Ngram Viewer to Study Implications of Social and Political Change for Cultural Values: Cultural Evolution in China.” International Journal of Psychology 50 (1): 4755.
Type Description Title
Supplementary materials

Richey and Taylor supplementary material
Web Appendix

 PDF (158 KB)
158 KB

Google Books Ngrams and Political Science: Two Validity Tests for a Novel Data Source

  • Sean Richey (a1) and J. Benjamin Taylor (a2)


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed