Search

Probabilistic Record Linkage Using Pretrained Text Embeddings
Joseph T. Ornstein
Journal:

Political Analysis , First View

Published online by Cambridge University Press:

28 August 2025, pp. 1-12
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Pretrained text embeddings are a fast and scalable method for determining whether two texts have similar meaning, capturing not only lexical similarity, but semantic similarity as well. In this article, I show how to incorporate these measures into a probabilistic record linkage procedure that yields considerable improvements in both precision and recall over existing methods. The procedure even allows researchers to link datasets across different languages. I validate the approach with a series of political science applications, and provide open-source statistical software for researchers to efficiently implement the proposed method.

Presidential negative partisanship
Benjamin S. Noble
Journal:

Political Science Research and Methods , First View

Published online by Cambridge University Press:

20 August 2025, pp. 1-20
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Presidents are often viewed as national policy leaders. Yet, they increasingly use negative rhetoric to attack the opposition rather than forge legislative compromise, contrary to theories of going public. Why? I argue presidents facing congressional obstruction eschew short-term policy persuasion. They speak as negative partisans to mobilize co-partisans and shape the longer-term balance of power in Congress, improving future policy-making prospects. I collect all presidential speeches delivered between 1933 and 2024 and use transformer methods to measure how often, and how negatively, presidents reference the out-party. They do so when the policy-making environment is unfavorable: when majorities are tenuous, government is divided, and as elections approach. I provide additional support with a case study of Democrats’ 2009 filibuster-proof Senate majority. Finally, this rhetoric has behavioral impact: presidential negative partisanship decreases co-partisan approval of the opposition. This research alters our understanding of going public and reinforces the partisan dimension of modern presidential representation.

Tracing institutional change in the officer corps using textual data from a military school: promise, pitfalls, and ethical considerations
Tamir Libel, Krystal Hachey
Journal:

Data & Policy / Volume 7 / 2025

Published online by Cambridge University Press:

16 July 2025, e50
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
In recent decades, researchers have analyzed professional military education (PME) organizations to understand the characteristics and transformation of the core of military culture, the officer corps. Several historical studies have demonstrated the potential of this approach, but they were limited by both theoretical and methodological hurdles. This paper presents a new historical-institutionalist framework for analyzing officership and PME, integrating computational social science methods for large-scale data collection and analysis to overcome limited access to military environments and the intensive manual labor required for data collection and analysis. Furthermore, in an era where direct demographic data are increasingly being removed from the public domain, our indirect estimation methods provide one of the few viable alternatives for tracking institutional change. This approach will be demonstrated using web-scraping and a quantitative text analysis of the entire repository of theses from an elite American military school.

Measuring Judicial Ideology Through Text
Jake S. Truscott, Michael K. Romano
Journal:

Journal of Law and Courts ,

Published online by Cambridge University Press:

08 July 2025, pp. 1-18
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Explorations of ideology retain special significance in contemporary studies of judicial politics. While some existing methodologies draw on voting patterns and coalition alignments to map a jurist’s latent features, many are otherwise reliant on supplemental proxies – often directly from adjacent actors or via assessments from various prognosticators. We propose an alternative that not only leverages observable judicial behavior, but does so through jurists’ articulations on the law. In particular, we adapt a hierarchical factor model to demonstrate how latent ideological preferences emerge through the written text of opinions. Relying on opinion content from Justices of the Supreme Court, we observe a discernible correlation between linguistic choices and latent expressions of ideology irrespective of known preferences or voting patterns. Testing our method against Martin-Quinn, we find our approach strongly correlates with this validated and commonly used measure of judicial ideology. We conclude by discussing the intuitive power of text as a feature of ideology, as well as how this process can extend to judicial actors and institutions beyond the Supreme Court.

Beyond standardization: a comprehensive review of topic modeling validation methods for computational social science research
Jana Bernhard-Harrer, Randa Ashour, Jakob-Moritz Eberl, Petro Tolochko, Hajo Boomgaarden
Journal:

Political Science Research and Methods , First View

Published online by Cambridge University Press:

30 June 2025, pp. 1-19
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
As the use of computational text analysis in the social sciences has increased, topic modeling has emerged as a popular method for identifying latent themes in textual data. Nevertheless, concerns have been raised regarding the validity of the results produced by this method, given that it is largely automated and inductive in nature, and the lack of clear guidelines for validating topic models has been identified by scholars as an area of concern. In response, we conducted a comprehensive systematic review of 789 studies that employ topic modeling. Our goal is to investigate whether the field is moving toward a common framework for validating these models. The findings of our review indicate a notable absence of standardized validation practices and a lack of convergence toward specific methods of validation. This gap may be attributed to the inherent incompatibility between the inductive, qualitative approach of topic modeling and the deductive, quantitative tradition that favors standardized validation. To address this, we advocate for incorporating qualitative validation approaches, emphasizing transparency and detailed reporting to improve the credibility of findings in computational social science research when using topic modeling.

Mapping (A)Ideology: A Taxonomy of European Parties Using Generative LLMs as Zero-Shot Learners
Riccardo Di Leo, Chen Zeng, Elias Dinas, Reda Tamtam
Journal:

Political Analysis , First View

Published online by Cambridge University Press:

14 April 2025, pp. 1-8
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
We perform the first mapping of the ideological positions of European parties using generative Artificial Intelligence (AI) as a “zero-shot” learner. We ask OpenAI’s Generative Pre-trained Transformer 3.5 (GPT-3.5) to identify the more “right-wing” option across all possible duplets of European parties at a given point in time, solely based on their names and country of origin, and combine this information via a Bradley–Terry model to create an ideological ranking. A cross-validation employing widely-used expert-, manifesto- and poll-based estimates reveals that the ideological scores produced by Large Language Models (LLMs) closely map those obtained through the expert-based evaluation, i.e., CHES. Given the high cost of scaling parties via trained coders, and the scarcity of expert data before the 1990s, our finding that generative AI produces estimates of comparable quality to CHES supports its usage in political science on the grounds of replicability, agility, and affordability.

“Let Me Just Interrupt You”: Estimating Gender Effects in Supreme Court Oral Arguments
Erica Cai, Ankita Gupta, Katherine A. Keith, Brendan O’Connor, Douglas Rice
Journal:

Journal of Law and Courts ,

Published online by Cambridge University Press:

10 March 2025, pp. 1-22
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Oral argument is the most public and visible part of the U.S. Supreme Court’s decision-making process. Yet what if some advocates are treated differently before the Court solely because of aspects of their identity? In this work, we leverage a causal inference framework to quantify the effect of an advocate’s gender on interruptions of advocates at both the Court-level and the justice-level. Examining nearly four decades of U.S. Supreme Court oral argument transcript data, we identify a clear and consistent gender effect that dwarfs other influences on justice interruption behavior, with female advocates interrupted more frequently than male advocates.

State-linked manipulated media in the time of Covid-19: a look at Iran
Benjamin E. Bagozzi, Karthik Balasubramanian, Rajni Goel, Chris Parker
Journal:

Data & Policy / Volume 7 / 2025

Published online by Cambridge University Press:

17 February 2025, e19
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
What drives changes in the thematic focus of state-linked manipulated media? We study this question in relation to a long-running Iranian state-linked manipulated media campaign that was uncovered by Twitter in 2021. Using a variety of machine learning methods, we uncover and analyze how this manipulation campaign’s topical themes changed in relation to rising Covid-19 cases in Iran. By using the topics of the tweets in a novel way, we find that increases in domestic Covid-19 cases engendered a shift in Iran’s manipulated media focus away from Covid-19 themes and toward international finance- and investment-focused themes. These findings underscore (i) the potential for state-linked manipulated media campaigns to be used for diversionary purposes and (ii) the promise of machine learning methods for detecting such behaviors.

How to train your stochastic parrot: large language models for political texts
Joseph T. Ornstein, Elise N. Blasingame, Jake S. Truscott
Journal:

Political Science Research and Methods / Volume 13 / Issue 2 / April 2025

Published online by Cambridge University Press:

14 January 2025, pp. 264-281
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
We demonstrate how few-shot prompts to large language models (LLMs) can be effectively applied to a wide range of text-as-data tasks in political science—including sentiment analysis, document scaling, and topic modeling. In a series of pre-registered analyses, this approach outperforms conventional supervised learning methods without the need for extensive data pre-processing or large sets of labeled training data. Performance is comparable to expert and crowd-coding methods at a fraction of the cost. We propose a set of best practices for adapting these models to social science measurement tasks, and develop an open-source software package for researchers.

A note of caution on CJEU databases
Michal Ovádek
Journal:

European Law Open / Volume 3 / Issue 2 / June 2024

Published online by Cambridge University Press:

27 November 2024, pp. 353-359
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The purpose of this short research note is to draw attention to two major pitfalls of working with databases of decisions of the Court of Justice of the European Union. The first one is technical in nature and relates to the discrepant coverage of the Curia and Eur-Lex databases. The second one is linguistic in nature and relates to the fact that most scholars using these databases work in English. New work on this front is capable of addressing the first issue but a change to research practices would be required to address the second.

7 - Within-Election Adaptative Effect
from Part II - Party Transformation
Mike Cowburn, European University Viadrina
Book:

Party Transformation in Congressional Primaries

Published online:

09 November 2024

Print publication:

14 November 2024, pp 181-198
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Primaries might also contribute to party transformation by incentivizing candidates to move position within an election cycle. Candidates might face a “strategic positioning dilemma” if they must first satisfy an extreme selectorate to earn the nomination before facing a comparatively moderate general electorate. This chapter therefore tests whether all candidates in a primary adapt their positions away from the center during the nomination phase of a single election cycle, presenting general election voters with polarized choices. To scale positions both during and after a primary it uses a text-as-data approach based on candidates’ communication on Twitter during the 2020 election cycle. It finds that Democratic candidates who lost primaries became significantly more moderate immediately after their defeat, especially if they lost in ideological or factional primaries. It does not observe this pattern among Republican losers. This chapter demonstrates a further way in which primaries may contribute to polarization, incentivizing candidates to adopt positions further from the ideological center during the nomination phase of the election cycle.

PopBERT. Detecting Populism and Its Host Ideologies in the German Bundestag
Lukas Erhard, Sara Hanke, Uwe Remer, Agnieszka Falenska, Raphael Heiko Heiberger
Journal:

Political Analysis / Volume 33 / Issue 1 / January 2025

Published online by Cambridge University Press:

01 October 2024, pp. 1-17
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The rise of populism concerns many political scientists and practitioners, yet the detection of its underlying language remains fragmentary. This paper aims to provide a reliable, valid, and scalable approach to measure populist rhetoric. For that purpose, we created an annotated dataset based on parliamentary speeches of the German Bundestag (2013–2021). Following the ideational definition of populism, we label moralizing references to “the virtuous people” or “the corrupt elite” as core dimensions of populist language. To identify, in addition, how the thin ideology of populism is “thickened,” we annotate how populist statements are attached to left-wing or right-wing host ideologies. We then train a transformer-based model (PopBERT) as a multilabel classifier to detect and quantify each dimension. A battery of validation checks reveals that the model has a strong predictive accuracy, provides high qualitative face validity, matches party rankings of expert surveys, and detects out-of-sample text snippets correctly. PopBERT enables dynamic analyses of how German-speaking politicians and parties use populist language as a strategic device. Furthermore, the annotator-level data may also be applied in cross-domain applications or to develop related classifiers.

Who deserves economic relief? Examining Twitter/X debates about Covid-19 economic relief for small businesses and the self-employed in Germany
Till Hilmar
Journal:

Journal of Social Policy , First View

Published online by Cambridge University Press:

12 September 2024, pp. 1-17
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The economic shock of the Covid-19 crisis has disproportionately impacted small businesses and the self-employed. Around the globe, their survival during the pandemic often relied heavily on government assistance. This article explores how economic relief to business is understood through the lens of deservingness in the public. It examines the case of Germany, where the government has responded to the pandemic by implementing an extensive support programme. Notably, in this context, the self-employed are typically outsiders to the state insurance system. Combining computational social science methods and a qualitative analysis, the article focuses on the debate about direct subsidies on the social media platform Twitter/X between March 2020 and June 2021. It traces variation in the patterns of claim making in what is a rich debate about pandemic state support, finding that this discourse is characterised by the concern that economic relief threatens to blur existing boundaries of worth in society. The reciprocity principle of deservingness theory is pivotal in asserting business identities in times of crisis, yet it also reveals a fundamentally ambiguous relationship with the principle of need. Additionally, the claim of justice-as-redress, as a novel dimension of reciprocity, surfaces as an important theme in this debate.

Quality of legislation and compliance: a natural language processing approach
Moritz Osnabrügge, Matia Vannoni
Journal:

Political Science Research and Methods / Volume 13 / Issue 3 / July 2025

Published online by Cambridge University Press:

15 August 2024, pp. 736-744
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Several disciplines, such as economics, law, and political science, emphasize the importance of legislative quality, namely well-written legislation. Low-quality legislation cannot be easily implemented because the texts create interpretation problems. To measure the quality of legal texts, we use information from the syntactic and lexical features of their language and apply these measures to a dataset of European Union legislation that contains detailed information on its transposition and decision-making process. We find that syntactic complexity and vagueness are negatively related to member states’ compliance with legislation. The finding on vagueness is robust to controlling for member states’ preferences, administrative resources, length of texts, and discretion. However, the results for syntactic complexity are less robust.

Nonrandom Tweet Mortality and Data Access Restrictions: Compromising the Replication of Sensitive Twitter Studies
Andreas Küpfer
Journal:

Political Analysis / Volume 32 / Issue 4 / October 2024

Published online by Cambridge University Press:

17 May 2024, pp. 493-506
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Used by politicians, journalists, and citizens, Twitter has been the most important social media platform to investigate political phenomena such as hate speech, polarization, or terrorism for over a decade. A high proportion of Twitter studies of emotionally charged or controversial content limit their ability to replicate findings due to incomplete Twitter-related replication data and the inability to recrawl their datasets entirely. This paper shows that these Twitter studies and their findings are considerably affected by nonrandom tweet mortality and data access restrictions imposed by the platform. While sensitive datasets suffer a notably higher removal rate than nonsensitive datasets, attempting to replicate key findings of Kim’s (2023, Political Science Research and Methods 11, 673–695) influential study on the content of violent tweets leads to significantly different results. The results highlight that access to complete replication data is particularly important in light of dynamically changing social media research conditions. Thus, the study raises concerns and potential solutions about the broader implications of nonrandom tweet mortality for future social media research on Twitter and similar platforms.

Partisan communication in two-stage elections: the effect of primaries on intra-campaign positional shifts in congressional elections
Mike Cowburn, Marius Sältzer
Journal:

Political Science Research and Methods / Volume 13 / Issue 2 / April 2025

Published online by Cambridge University Press:

10 January 2024, pp. 392-411
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The influence of congressional primary elections on candidate positioning remains disputed and poorly understood. We test whether candidates communicate artificially “extreme” positions during the nomination, as revealed by moderation following a primary defeat. We apply a scaling method based on candidates language on Twitter to estimate positions of 988 candidates in contested US House of Representatives primaries in 2020 over time, demonstrating validity against NOMINATE (r > 0.93) where possible. Losing Democratic candidates moderated significantly after their primary defeat, indicating strategic position-taking for perceived electoral benefit, where the nomination contest induced artificially “extreme” communication. We find no such effect among Republicans. These findings have implications for candidate strategy in two-stage elections and provide further evidence of elite partisan asymmetry.

Selecting More Informative Training Sets with Fewer Observations
Aaron R. Kaufman
Journal:

Political Analysis / Volume 32 / Issue 1 / January 2024

Published online by Cambridge University Press:

08 June 2023, pp. 133-139
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model’s performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.

Fiscal data in text: Information extraction from audit reports using Natural Language Processing
Alejandro Beltran
Journal:

Data & Policy / Volume 5 / 2023

Published online by Cambridge University Press:

28 February 2023, e7
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Supreme audit institutions (SAIs) are touted as an integral component to anticorruption efforts in developing nations. SAIs review governmental budgets and report fiscal discrepancies in publicly available audit reports. These documents contain valuable information on budgetary discrepancies, missing resources, or may even report fraud and corruption. Existing research on anticorruption efforts relies on information published by national-level SAIs while mostly ignoring audits from subnational SAIs because their information is not published in accessible formats. I collect publicly available audit reports published by a subnational SAI in Mexico, the Auditoria Superior del Estado de Sinaloa, and build a pipeline for extracting the monetary value of discrepancies detected in municipal budgets. I systematically convert scanned documents into machine-readable text using optical character recognition, and I then train a classification model to identify paragraphs with relevant information. From the relevant paragraphs, I extract the monetary values of budgetary discrepancies by developing a named entity recognizer that automates the identification of this information. In this paper, I explain the steps for building the pipeline and detail the procedures for replicating it in different contexts. The resulting dataset contains the official amounts of discrepancies in municipal budgets for the state of Sinaloa. This information is useful to anticorruption policymakers because it quantifies discrepancies in municipal spending potentially motivating reforms that mitigate misappropriation. Although I focus on a single state in Mexico, this method can be extended to any context where audit reports are publicly available.

When Correlation Is Not Enough: Validating Populism Scores from Supervised Machine-Learning Models
Michael Jankowski, Robert A. Huber
Journal:

Political Analysis / Volume 31 / Issue 4 / October 2023

Published online by Cambridge University Press:

09 January 2023, pp. 591-605
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Despite the ongoing success of populist parties in many parts of the world, we lack comprehensive information about parties’ level of populism over time. A recent contribution to Political Analysis by Di Cocco and Monechi (DCM) suggests that this research gap can be closed by predicting parties’ populism scores from their election manifestos using supervised machine learning. In this paper, we provide a detailed discussion of the suggested approach. Building on recent debates about the validation of machine-learning models, we argue that the validity checks provided in DCM’s paper are insufficient. We conduct a series of additional validity checks and empirically demonstrate that the approach is not suitable for deriving populism scores from texts. We conclude that measuring populism over time and between countries remains an immense challenge for empirical research. More generally, our paper illustrates the importance of more comprehensive validations of supervised machine-learning models.

Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text
Tobias Widmann, Maximilian Wich
Journal:

Political Analysis / Volume 31 / Issue 4 / October 2023

Published online by Cambridge University Press:

29 June 2022, pp. 626-641
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Previous research on emotional language relied heavily on off-the-shelf sentiment dictionaries that focus on negative and positive tone. These dictionaries are often tailored to nonpolitical domains and use bag-of-words approaches which come with a series of disadvantages. This paper creates, validates, and compares the performance of (1) a novel emotional dictionary specifically for political text, (2) locally trained word embedding models combined with simple neural network classifiers, and (3) transformer-based models which overcome limitations of the dictionary approach. All tools can measure emotional appeals associated with eight discrete emotions. The different approaches are validated on different sets of crowd-coded sentences. Encouragingly, the results highlight the strengths of novel transformer-based models, which come with easily available pretrained language models. Furthermore, all customized approaches outperform widely used off-the-shelf dictionaries in measuring emotional language in German political discourse.

Search Results

Refine search

Refine search

Actions for selected content:

30 results

Probabilistic Record Linkage Using Pretrained Text Embeddings

Presidential negative partisanship

Tracing institutional change in the officer corps using textual data from a military school: promise, pitfalls, and ethical considerations

Measuring Judicial Ideology Through Text

Beyond standardization: a comprehensive review of topic modeling validation methods for computational social science research

Mapping (A)Ideology: A Taxonomy of European Parties Using Generative LLMs as Zero-Shot Learners

“Let Me Just Interrupt You”: Estimating Gender Effects in Supreme Court Oral Arguments

State-linked manipulated media in the time of Covid-19: a look at Iran

How to train your stochastic parrot: large language models for political texts

A note of caution on CJEU databases

7 - Within-Election Adaptative Effect

Summary

PopBERT. Detecting Populism and Its Host Ideologies in the German Bundestag

Who deserves economic relief? Examining Twitter/X debates about Covid-19 economic relief for small businesses and the self-employed in Germany

Quality of legislation and compliance: a natural language processing approach

Nonrandom Tweet Mortality and Data Access Restrictions: Compromising the Replication of Sensitive Twitter Studies

Partisan communication in two-stage elections: the effect of primaries on intra-campaign positional shifts in congressional elections

Selecting More Informative Training Sets with Fewer Observations

Fiscal data in text: Information extraction from audit reports using Natural Language Processing

When Correlation Is Not Enough: Validating Populism Scores from Supervised Machine-Learning Models

Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

30 results

Summary