Hostname: page-component-6766d58669-7cz98 Total loading time: 0 Render date: 2026-05-21T17:59:14.005Z Has data issue: false hasContentIssue false

A Dynamic Model of Speech for the Social Sciences

Published online by Cambridge University Press:  02 March 2021

DEAN KNOX*
Affiliation:
University of Pennsylvania
CHRISTOPHER LUCAS*
Affiliation:
Washington University in St. Louis
*
Dean Knox, Faculty Fellow of Analytics at Wharton, and Assistant Professor, The Wharton School of the University of Pennsylvania, dcknox@upenn.edu.
Christopher Lucas, Assistant Professor, Department of Political Science, Washington University in St. Louis, christopher.lucas@wustl.edu.
Rights & Permissions [Opens in a new window]

Abstract

Speech and dialogue are the heart of politics: nearly every political institution in the world involves verbal communication. Yet vast literatures on political communication focus almost exclusively on what words were spoken, entirely ignoring how they were delivered—auditory cues that convey emotion, signal positions, and establish reputation. We develop a model that opens this information to principled statistical inquiry: the model of audio and speech structure (MASS). Our approach models political speech as a stochastic process shaped by fixed and time-varying covariates, including the history of the conversation itself. In an application to Supreme Court oral arguments, we demonstrate how vocal tone signals crucial information—skepticism of legal arguments—that is indecipherable to text models. Results show that justices do not use questioning to strategically manipulate their peers but rather engage sincerely with the presented arguments. Our easy-to-use R package, communication, implements the model and many more tools for audio analysis.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press on behalf of the American Political Science Association
Figure 0

Figure 1. Illustration of Selected Audio FeaturesNote: The left column identifies a class of audio summaries that are used to represent the raw audio data. Subsequent columns contain textual descriptions and graphical depictions of audio sources for which the relevant feature has relatively low and high values. For example, ZCR (zero-crossing rate) has a low value for the vowel /a/ and a high value for the consonant /s/. The ZCR and energy graphs depict pressure waveforms from machine-synthesized speech recordings; louder sounds are larger in amplitude and hissing sounds are higher in ZCR. Spectral graphs represent the Fourier transformation of the synthesized recordings; female voices are concentrated in higher frequency ranges. Pitch is an example of a feature that can be derived from the spectral peaks.

Figure 1

Figure 2. Illustration of Generative ModelNote: The directed acyclic graph represents the relationships encoded in Equations 1–4. In utterance $ u $, the speaker selects tone $ {S}_u $ based on “static” (i.e., externally given) time-varying covariates $ {\boldsymbol{W}}_u^{\mathrm{stat}.} $ as well as “dynamic” conversational history covariates $ {\boldsymbol{W}}_u^{\mathrm{dyn}.} $. (In the illustration, $ {\boldsymbol{W}}_u^{\mathrm{dyn}.} $ depends only on the prior mode of speech, but more complex dynamic covariates can be constructed.) Based on the selected tone, the speaker composes an utterance by cycling through a sequence of sounds in successive moments, $ {R}_{u,1},{R}_{u,2},\dots $, to form the word “mass.” Each sound generates the audio perceived by a listener according to its unique profile; $ {\boldsymbol{X}}_{u,t} $ is extracted from this audio.

Figure 2

Table 1. Observed and Unobserved Quantities

Figure 3

Figure 3. An Illustrative ExampleNote: Panel A contains excerpts from Alabama Legislative Black Caucus v. Alabama, where Justices Scalia, Kennedy, and Breyer utilize neutral and skeptical tones in questioning. Call-outs highlight successive utterance pairs in which the speaker shifted from one mode to another (B.3), and continued in the same tone of voice (B.1 and B.2). Panels C.1 and C.2 illustrate the use of loudness (text size) and pitch (contours) within utterances: in the neutral mode of speech (C.1), speech varies less in pitch and loudness when compared with skeptical speech (C.2). Based on these and other features, MASS learns to categorize sounds into vowels (dark squares), consonants (light), and pauses (white). Call-outs D.1 and D.2 respectively identify sequential moments in which a “neutral” vowel is sustained (transition from the dark blue sound back to itself, indicating repeat) and the dark red “skeptical” vowel transitions to the light red consonant. Panel E shows the differing auditory characteristics of the “skeptical” vowel and consonant, which are perceived by the listener.

Figure 4

Figure 4. Predicting Justice Votes with Directed Skepticism and Directed Affective LanguageNote: Horizontal error bars represent point estimates and 95% confidence intervals from regressions of justice votes on directed pleasant words, directed unpleasant words, and our audio-based directed skepticism. Red circles correspond to a specification with no additional controls; blue triangles report results with speaker fixed effects only, black squares with speaker and case fixed effects.

Figure 5

Figure 5. Simulated Quantities of InterestNote: Each panel manipulates a single variable from a control value (second element of panel title) to a treatment value (first). Points (error bars) represent changes (95% bootstrap confidence intervals) in predicted skepticism. We average over all other nuisance covariates (e.g., the identity of the next speaker) of the scenario-specific change in outcome, weighting by the empirical frequencies of these covariates. The top panel shows that justices deploy skepticism more often toward their nonpreferred side. The second panel compares close votes with unanimous decisions, demonstrating that justices express more skepticism across the board in the former. However, justices do not attempt to target a particular side in close votes; rather, they simply ask more skeptical questions across the board. Finally, the bottom panel shows that justices mirror the tones of their ideological neighbors, who share similar legal philosophies, even when those neighbors are opposed to the justice’s case-specific voting interests.

Supplementary material: PDF

Knox and Lucas supplementary material

Knox and Lucas supplementary material

Download Knox and Lucas supplementary material(PDF)
PDF 1.7 MB
Submit a response

Comments

No Comments have been published for this article.