Online Appendix: A Dynamic Model of Speech for the Social Sciences

1 Estimation 2 1.1 Factorization of the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Estimation of Lower-Level Auditory Parameters . . . . . . . . . . . . . . . . 3 1.2.1 E step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 M Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Unmodeled Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Estimation of Upper-Level Conversation Parameters . . . . . . . . . . . . . . 10 1.5 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

man speech contains more than M modes-the proposed approach can in fact outperform full maximum likelihood. More generally, semi-supervised approaches that exploit both labeled and unlabeled data often underperform those that only use the former (Masanori and Takeuchi, 2014). Intuitively, this is because unsupervised methods rarely recover the analyst's preferred labels, and semi-supervised techniques are typically dominated by the much larger unlabeled dataset.
Finally, we note that even with moderately sized training sets, the number of moments in X T will be already be several orders of magnitude larger than the number of parameters, due to the high-frequency nature of audio data, so that Θ is already reasonably well-estimated from the training utterances alone.

Estimation of Lower-Level Auditory Parameters
To estimate the parameters of the M lower-level models, which each represent the auditory characteristics of a particular speech mode, we employ a non-sequential training set of example utterances that are assumed to be drawn from the same distribution as the primary corpus (conditional on mode). In the main text, the audio features of the training set are denoted X T , and the corresponding tone labels are S T . Here, we drop T for convenience and work exclusively within the training set.
3 Consider the subset with known mode S u = m. 1 This group of utterances is assumed to be drawn from a single shared Gaussian HMM, the speech model for mode m. Below, we describe how lower-level parameters are estimated by standard HMM techniques. Interested readers are referred to Zucchini and MacDonald (2009) for further discussion.
We first write down the likelihood function for parameters of the m-th mode. For each utterance, at each moment t, the feature vector X u,t could have been generated by any of the K sounds associated with emotion m, so there are K Tu possible sequences of unobserved sounds by which the entire feature sequence X u could have been generated. The u-th utterance's contribution to the observed-data likelihood is the joint probability of all observed features, found by summing over every possible sequence of sounds. This yields Pr(X u,1 = x u,1 , · · · , X u,Tu = x u,Tu | µ m,k , Σ m,k , Γ m ) 1(Su=m) = U u=1 δ m P m (x u,1 ) Tu t=2 Γ m P m (x u,t ) 1 where µ m = (µ m,k ) k∈{1,...,K} , Σ m = (Σ m,k ) k∈{1,...,K} , δ m is a 1×K vector containing the initial 1 In practice, because the perception of certain speech modes can be subjective (human coders may disagree or be uncertain), training set mode labels S u may be a stochastic vector of length M ,S u = [Pr(S u = 1), . . . , Pr(S u = M )], rather than a M -valued categorical variable.
In such cases the contribution of an utterance to the model for emotion m may be weighted by the m-th entry, e.g. corresponding to the proportion of human coders who classified the utterance as emotion m. After replacing 1(S u = m) with Pr(S u = m), the procedure described in this appendix can be used without further modification.
4 distribution of sounds (assumed to be the stationary distribution, a unit row eigenvector of Γ m ), the matrices P m (x u,t ) ≡ diag φ D (x u,t |µ m,k , Σ m,k ) are K × K diagonal matrices in which the (k, k)-th element is the (D-variate Gaussian) probability of x u,t being generated by sound k, and 1 is a column vector of ones.
In practice, due to the high dimensionality of the audio features, we also regularize the Σ terms to ensure invertibility by adding a small positive value (which may be thought of as a prior) to its diagonal. We recommend setting this regularization parameter, along with the number of sounds, by selecting values that maximize the training set's cross-validated naïve probabilities (i.e., based on mode prevalence and emission probabilities, ignoring context). This procedure asymptotically selects the closest approximation, in terms of the Kullback-Leibler divergence, to the true data-generating process among the candidate models considered (van der Laan, Dudoit, and Keles, 2004).
The parameters µ m,k , Σ m,k , and Γ m can in principle be found by directly maximizing this likelihood. However, given the vast number of parameters to optimize over, we estimate using the Baum-Welch algorithm for expectation-maximization with hidden Markov models.
In what follows, we describe this procedure as it relates to the estimation of the lowerlevel audio parameters. Baum-Welch involves maximizing the complete-data likelihood of Equation 2, which differs from equation 1 in that it also incorporates the probability of the 5 unobserved sounds.

E step
This procedure relies heavily on the joint probability of (i ) all feature vectors up until time t and (ii ) the sound at t, given in equation 3. These probabilities are efficiently calculated for all t in a single recursive forward pass through the feature vectors.
α u,t,k = f (X u,1 = x u,1 , · · · , X u,t = x u,t , R u,t = k) α u,t = [α u,t,1 , . . . , α u,t,K ] It also relies on the conditional probability of (i ) all feature vectors after t given (ii ) the sound at t (equation 4). These are similarly calculated by backward recursion through the 6 utterance.
The E step involves substituting (i ) the unobserved sound labels, 1(R u,t = k), and (ii ) the unobserved sound transitions, 1(R u,t = k , R u,t−1 = k), with their respective expected values, conditional on the observed training features X u and the current estimates of Θ m = (µ m,k , Σ m,k , Γ m ).
For (i ), combining equations 1, 3 and 4 immediately yields the expected sound label where the tilde denotes the current approximation based on parameters from the previous M step; α u,t,k and β u,t,k are the k-th elements of α u,t and β u,t respectively; andL m u is the u-th training utterance's contribution toL m .
For (ii ), after some manipulation, the expected sound transitions can be expressed as

M Step
After substituting equations 5 and 6 into the complete-data likelihood (equation 2), the M step involves two straightforward calculations. First, the conditional maximum likelihood update of the transition matrix Γ m follows from equation 6: Second, the optimal update of the k-th sound distribution parameters are found by fitting a Gaussian distribution to the feature vectors, with the weight of the t-th instant being given by the expected value of its k-th label.
where W m,k

Unmodeled Autocorrelation
If the Gaussian HMM model of speech described in Equations 3-4 were correctly specified, then the tone of any new utterance could be classified with well-calibrated posterior probabilities based on its auditory characteristics (setting aside conversation context) by the simple application of Bayes' rule, Pr(S u = m|X u , Θ) = However, this speech model-like all simplified models of complex human behavior-is misspecified, with implications for its resulting predictions. In particular, our model assumes that the auditory features in successive moments are conditionally independent, given their respective sounds. This can be seen by noting that X u,1 and X u,2 are d-separated by R u,1 and R u,2 in Figure 2. In other words, the expected difference in audio between moment t and t + 1 should be no greater than the difference between t and t + 10, as long as a vowel is being spoken.
This assumption makes the model analytically tractable, much as the bag-of-words as-sumption facilitates text analysis. Like the bag-of-words assumption, it is also clearly violated by actual human behavior. A speaker's vocal tract is physically incapable of changing much in a few milliseconds, but this autocorrelation in features goes unmodeled. Thus, the model mistakenly perceives the information content of an utterance to be T u data points, when in fact it may be much less. The practical implication is that mode probabilities produced by the aforementioned approach will drift toward zero and one, leading to dramatic miscalibration.
To address this issue, we use a corrective factor, δ m P m (x u,1 ) This scales back the utterance's contribution to the log likelihood multiplicatively, reducing the utterance's "effective value" to ρT u . The corrective factor is estimated from out-of-sample data by maximizing the total log corrected probabilities of the correct class.

Estimation of Upper-Level Conversation Parameters
We now describe our procedure for estimating the conversation flow parameters by maximizing the observed-data likelihood of Equation 5 with respect to ζ, which amounts to maximizing f (X C | ζ ζ ζ, Θ, W stat.,C ). This is equivalent to estimating both the unobserved S C and parameters ζ by maximizing the expected complete-data log likelihood. (All analysis in this subsection is of the primary corpus, so we drop the C indicator for compactness.) For complete generality, we also introduce a conversation index v ∈ {1, . . . , V }. The number of utterances in conversation v is denoted U v ; metadata, speech modes and audio features for utterance u in conversation v are respectively W v,u , S v,u and X v,u .
First, the complete-data likelihood of the primary corpus is δ v indicates the initial distribution of speech modes for conversation v.
Because the time-varying transition matrix, ∆ v,u , is a multinomial logistic function of conversation context, W v,u -which is itself a potentially complex function of unobserved prior speech modes-deriving the closed-form expectation of the complete-data likelihood is intractable. We therefore replace this expectation with the following blockwise procedure that sweeps through the unobserved variables sequentially.
1. The metadata W v,u depends on conversation history, but the previous mode is unobserved. Therefore, for each utterance, we create a separate metadata vector for each possible prior mode. (This is computationally infeasible for longer-range summaries of conversation history e.g., aggregate anger expressed over the course of a debate, so we recommend a mean-field approximation for dynamic metadata based on utterances older than u − 1.) This step produces M possible metadata vectors, 2. Each possible metadata vector implies a vector of probabilities for the next utterance, .
Stack these into a transition matrix,∆.
, using a forward-backward algorithm that is essentially identical to Equations 5 and 6. We find that the use of the corrected emission probabilities, described in Appendix 1.3, is crucial in this step.
Again, tildes indicate the best guess for each variable at the current iteration. The maximization step for ζ then reduces to weighted constrained multinomial logistic regression in which all possible transitions are included, weighted byẼ[1(S v,u−1 = m, S v,u = m )]. A constraint on the mode-specific intercepts ensures that the fitted probabilities agree with the known tone proportions; this is implemented by first computing the relaxed update for ζ in each iteration, then imposing the constraint. The estimated initial mode, δ v follows directly from the expected value of [1(S v,1 = m)]. All in all, the use of this alternative procedure leads to a smaller improvement of the EM objective function than the full (infeasible) E-step would. Nevertheless, algorithms using such partial E-or M-steps ultimately converge to a local maximum, as does traditional expectation-maximization (Neal and Hinton, 1998).

Bootstrapping
Because each bootstrapped speech-mode model's parameters only enter the upper model through how well or poorly they explain a particular utterance's observed auditory features, the upper model is unaffected by likelihood invariance issues such as the label-switching problem. However, to the extent that some bootstrapped model runs are trapped in local modes and do not attain the global optimum, resulting upper-level confidence intervals will be wider (that is, more conservative), reflecting both true uncertainty and the additional random variation in the selected local mode. This pitfall may be addressed by standard optimization procedures such as simulated-annealing EM or running multiple chains. Table 1 lists the primary features we calculate for each utterance. In addition, we calculate interactions between and derivatives of these primary features.

Feature (#) Description
energy (1) sound intensity, in decibels: log 10 x 2 t ZCR (1) zero-crossing rate of audio signal autocorrelation (1) Cor(x t , x t−1 ) TEO (1) Teager energy operator: log 10 x 2 t − x t−1 x t+1 F0 (2) fundamental, or lowest, dominant frequency of speech signal (closely related to perceived pitch; tracked by two algorithms) formants (6) harmonic frequencies of speech signal, determined by shape of vocal tract (lowest three formants and their bandwidths) MFCC (13) Mel-frequency cepstral coefficients (characterizing the shape of the frequency spectrum, after transforming and binning the spectrum to approximate human perception of sound intensity) factor, even when only applied to a subset of districts rather than statewide, constituted illegal racial gerrymandering.
In what follows, we consider legal jockeying in oral arguments over a contentious and highly consequential debate: Whether Section 5 of the Voting Rights Act (VRA), prohibiting retrogression in minorities' "ability to elect their preferred candidates," meant that Alabama had to continually maintain or increase the numerical percentage of black voters in blackdominated districts. If so, the state's consideration of race would be "narrowly tailored" to meeting its VRA obligations, and thus legal. 3 We focus in particular on questioning by Justices Breyer and Scalia, who respectively wrote the majority and dissenting opinions, as well as by Justice Kennedy, who cast the pivotal vote.
Panel 1 in Figure 1 presents a condensed transcript of one instance when this issue arose during arguments by a liberal advocate representing the Alabama Democratic Conference.
Early on, Justice Scalia takes the position that the state was legally bound to maintain or increase black percentages. His stance was far from novel, as it had already been discussed extensively in briefs and lower-court decisions available to all justices. But Justice Scalia repeats the point nonetheless, questioning the liberal advocate not only skeptically, 3 The ruling concluded that the Republican legislature "relied heavily upon a mechanically numerical view as to what counts as forbidden retrogression... And the difference between that view and the more purpose-oriented view reflected in the statute's language can matter.
Imagine a majority-minority district with a 70% black population... it would seem highly unlikely that... reduc[ing] the percentage of the black population from, say, 70% to 65% would have a significant impact on the black voters' ability to elect their preferred candidate.
And, for that reason, it would be difficult to explain just why a plan that uses racial criteria predominately to maintain the black population at 70% is "narrowly tailored" to achieve a "compelling state interest," namely the interest in preventing Section 5 retrogression.
but sarcastically-theatrically drawing out his words and even exclaiming "gee." The ploy appears to be effective. Justice Kennedy follows up on the topic, skeptically wondering why it was legal for Democrats to disperse black voters, but not for Republicans to concentrate them: a "one-way ratchet." Sensing a threat, Justice Breyer attempts to smooth things over To demonstrate how the MASS is able to do this, in Figure 1 (duplicated here for convenience) we turn to a close examination of two prototypical utterances by Justice Breyer. We first discuss the sounds of which each utterance is composed, along with their auditory profiles. Consider Justice Breyer's skeptical mode of speech-the tone in which he rhetorically exclaims "Now if that's so, they don't have Section 5 to rely on as a defense!" He communicates through a sequence of sounds that, simplistically, we might categorize into "vowel," "consonant," and "silence." 5 In Panel 1, we show that our generative model of skeptical speech mirrors this structure: Vowels (dark red) are sustained for a few moments (horizontally arrayed cells) before Justice Breyer transitions to consonants (light red strikethrough) and eventually pauses in silence (white) between words 6 . One such transition is depicted in Figure 1.D.2. Just as a human can recognize phonemes from their auditory characteristics, our model automatically learns to distinguish vowels (based on their higher autocorrelation, as encoded in µ skeptical,vowel ) from consonants (high zero-crossing rate), as shown in Figure 1. 5 We note that sound labels, like topic labels in latent Dirichlet allocation text models, are subjective descriptions of component distributions in unsupervised learning models. However, human speech is highly structured. Across a wide range of applications, we consistently find that HMMs recover states that correspond closely to theoretically motivated phoneme groups.
6 Because each moment describes just milliseconds of audio, glottal stops and short pauses between words are an observable component of speech.
17 Figure 1: An illustrative example. Panel A contains an excerpt from Alabama Legislative Black Caucus v. Alabama, where Justices Scalia, Kennedy, and Breyer utilize neutral and skeptical tones in questioning. Call-outs highlight successive utterance-pairs in which the speaker shifted from one mode to another (B.3), and continued in the same tone of voice (B.1 and B.2). Panels C.1 and C.2 illustrate the use of loudness (text size) and pitch (contours) in a single utterance: in the neutral mode of speech (C.1), speech varies less in pitch and loudness when compared to skeptical speech (C.2). On the basis of these and other features, MASS learns to categorize sounds into vowels (dark squares), consonants (light), and pauses (white). Call-outs D.1 and D.2 respectively identify sequential moments in which a "neutral" vowel is sustained (transition from the dark blue sound back to itself, indicating repeat) and the dark red "skeptical" vowel transitions to the light red consonant. Panel E shows the differing auditory characteristics of the "skeptical" vowel and consonant, which are perceived by the listener.
Why does this matter? It is on the basis of these constituent sounds that MASS is able to discern differences between rhetorical styles. As Figure 1.A makes clear, MASS contains a parallel "neutral" model for Justice Breyer's speech alongside the "skeptical" model. While neutral speech also uses vowels and consonants, the auditory profiles of these sounds differ dramatically.

Facial Validity of Predicted Skepticism
Before proceeding to more substantial results, we first demonstrate the face validity of MASS predictions in a qualitative examination of machine-generated utterance labels. 20 Skeptical Speech Neutral Speech And that helps women.
Mr. Frederick? It said the rationale is unconscionable.
And because it's a regulated industry, the regulator in your view is doing one of the worst jobs in history. You think the answer to that is clearly no.
And -and the difference between the monitoring and what happened in the past is memories are fallible, computers aren't. Isn't it arguably in part to protect consumers?
Then if the Polynesian boat is permanently in the museum, there's a lot of objective evidence of that, it would not be a vessel. The reason that they want to appeal is they want to win.
What about the -this as I understand it, came up originally as arbitration under the -wasn't it under the collective bargaining contract? But the lower court said it shouldn't be weighed against the State, period.
You're talking about a very narrow range of cases, because I take it your principal position is it -it would be unusual that the defendant needs to be competent in order for the lawyer effectively to represent him on habeas. Well, that's simply because, as we said in Allegheny Pittsburgh, the basis for considering the equal protection claim is the rights that you're given under State law.
I mean, Justice Ginsburg wrote the majority, and she said the reference to regulatory authority of a State, which is a different reference, I agree, should be read to preserve, not preempt traditional prerogative for the State. It did command a majority of the Court, it is authoritative decision, and there are obviously different views among different judges about the extent to which they are the same or not.
If you start talking about significant effect, without those last words, " deregulatory purpose " I suddenly worry about the following: That every city in the United States depends upon towing to regulate parking within the city. And to go back to Justice Sotomayor's question, as long as it's rational in at least some instances directly to pick out those States, at least one or two of them, then doesn't the statute survive a facial challenge?
Suppose a jurisdiction has the policy of requiring every inmate who is arrested and is going to be held in custody to disrobe and take a shower and apply medication for the prevention of the spread of lice and is observed while this is taking place from some distance by a corrections officer, let's say 10 feet away. You made him give it to him. So what's wrong with his saying, you go give it to somebody? Now, if it's too much trouble, the judge can say he can't make you go to a lot of trouble. If it's giving it to somebody who might really do everything he wants, we'll guard against that.
Well, Buckman -Buckman was arguably a little bit different, in that there's a concern expressed in that case that requiring allowing the State suit to go forward would cause manufacturers to basically inundate the agency with proposals and warning revisions, so that there would be so many things that the agency wouldn't even be able to process them, and they would become meaningless to the consumers.
Results from Section 4.1, which suggest that humans such as the reader (presumably) can validate model-predicted skepticism using utterance text-in extreme cases, at the leastindicate that auditory channel carries emotional information that can be detected by MASS.
But they also suggest that skepticism is partially conveyed through textual channels as well.
Could tone be extracted directly from the text without the need for complex audio models? To assess whether the auditory channel in fact conveys new information or is merely duplicative, we attempted to predict expressed skepticism using utterance transcripts. For each utterance, word counts were computed after stemming, stopping, and pruning words that appeared in fewer than ten utterances. A cross-validated elastic net was then applied to the utterance-term matrix, producing a maximum accuracy of 59.8%. Moreover, the textual classifier was only able to achieve this accuracy by predicting the dominant class (neutral speech, 59.4% of labeled utterances) for virtually every observation. Additional measures of classification performance, including for within-speaker classification, are reported in Appendix 4.4.
Next, to rule out the possibility that the roughly 1,600 hand-labeled utterances were too small of a training corpus, we analyze the full corpus. To do so, we treat MASS fitted probabilities of skepticism (based on audio features and conversation context) as the outcome.
We then employ a post-LASSO procedure in which a cross-validated LASSO-logistic model is estimated, then an unregularized logistic regression is fit on the selected terms (Belloni, Chernozhukov, and Wei, 2016).
The resulting coefficient estimates, plotted in Figure 2, demonstrate that there are ex- traordinarily few consistent textual indicators of expressed skepticism-the vast majority are statistically indistinguishable from zero at conventional levels. In Figure 3, we arbitrarily discard speaker-terms with p-values exceeding 0.05, then investigate the remainder more closely.
For Justice Stephen Breyer, an expressive orator who is by far the most frequently speaking justice, less than 50 such terms exist. For illustrative purposes, we focus on Breyer's "broad," "indeed," and "marry," the three terms most heavily associated with his predicted skepticism. While these terms are not obviously associated with negative sentiments, a closer examination sheds light on Breyer's usage in his freewheeling and at times theatrical questioning: Washington D.C. and they happen to move to New York, you are saying that New York doesn't have to recognize that marriage because it doesn't comport with the marriage of New York; is that your point?" Conversely, Justice Breyer's neutral-leaning terms include technical terms ("prosecutor," "tort," and "argument") as well as the fairly innocuous ("thought" and "imagine"). While this particular justice's textual cues are plausible, however, his colleagues are far more difficult to read using word frequencies alone-perhaps because they signal their position in subtle ways, or perhaps because text is just a poor indicator of expressed emotion. For all other justices, we identify fewer than ten informative words through this procedure; moreover, their cumulative predictive power is virtually nonexistent.

Auditory Characteristics of Expressed Skepticism
The preceding results show that the textual channel is-at best-a noisy, idiosyncratic, or simply weak signal of a justice's expressed skepticism. What, then, distinguishes skeptical questioning from neutral speech? To demonstrate, we interpret MASS results by investigating the auditory characteristics of median justice Anthony Kennedy's speech. For Kennedy, we found that a moderately regularized speech model with K = 3 latent sounds minimized the total cross-validated likelihood of out-of-sample auditory features. Three well-separated sound classes can be consistently observed across model runs. We subjectively characterize these as "voiced speech" such as vowels, in which the vocal cords vibrate (high autocorrelation); "unvoiced speech," such as fricatives and sibilants, in which vocal cords are not used (moderate energy and zero-crossing rate); and "silence" (low energy). Using an alignment procedure described below, we identify the three sounds in each bootstrapped model. For illustrative purposes, we compare the auditory characteristics of voiced skeptical speech to 25 voiced neutral speech. The top panel of Figure 4 shows that when speaking skeptically, Kennedy speaks more loudly and with higher average pitch, a consequence of tensed vocal cords. Moreover, his modulation of pitch-which rises during questions and falls sharply during emphatic statements-is markedly larger in skeptical speech, as indicated by its higher pitch variance. We do not, however, observe similar modulation in energy: Kennedy is simply louder across the board when expressing skepticism. Finally, in the bottom panel, we contrast Justices Kennedy and Sotomayor to demonstrate that these speech dynamics are not entirely unique to individual speakers. While speaker baselines do vary-Sotomayor speaks more softly on average, and her voice is roughly six semitones higher-both communicate their skepticism by elevating pitch and raising their voices, among other auditory cues.  Figure 4: Auditory characteristics of neutral and skeptical speech.In the top panel, each dark red × (light blue •) represents a converged EM run for auditory parameters using a runspecific bootstrap draw of skeptical (neutral) training utterances for Justice Kennedy. Coordinates in a bivariate scatterplot are based on elements of µ skeptical, voiced (µ neutral, voiced ) and the diagonal of Σ skeptical, voiced (Σ neutral, voiced ). For example, the top right panel demonstrates that when speaking skeptically, Justice Kennedy's voice is markedly louder and exhibits more variation in pitch, relative to his neutral speech. The bottom panel compares the same parameters for Justice Sotomayor's skeptical (neutral) voiced speech, depicted with dark red (light blue ). While her voice is generally higher and quieter, on average, Sotomayor also communicates skepticism by elevating her pitch and speaking more loudly.
We now describe the technical details of the sound alignment procedure employed above.
To identify sounds that consistently recur across the M speech modes and B trained bootstrap models, we employ an ad-hoc but effective alignment approach consisting of the following steps. First, we take the M BK separate µ vectors, each representing the estimated average value of a sound for a particular bootstrap training set, and cluster these values using the k-means algorithm. The result of this procedure is M K distinct reference points in audio-feature space, which in the main-text example corresponded to the subjective categories "voiced speech/vowel", "unvoiced speech/consonant", and "silence." In each of the M B trained models, we then determine the optimal one-to-one assignment of the K (unlabeled) sounds to the K reference categories such that the cumulative Mahalanobis distance of each sound to its assigned reference point is minimized.
This procedure produces an approximation to the far more difficult task of assigning each sound to a category while minimizing the total within-category Mahalanobis distances under the constraint of no duplicate assignments. The latter task involves optimizing over K M B permutations, whereas the former consists of only M B separate K-to-K matching problems using the procedure of Hansen and Klopfer (2006).

Audio, Text, and Human Classification Performance
To validate the out-of-sample performance of the model, we treat the lower-level HMMs as auditory classifiers. (True out-of-sample performance of the full model is difficult to evaluate, because of dependencies introduced when modeling context and conversation flow.) As in the full model, bootstrap aggregation (bagging) is used to improve stability. Out-of-bag 28 (OOB, see e.g. Hastie, Tibshirani, and Friedman, 2001, 15.3.1) performance is computed as follows. First, for labeled utterance u, we take all of the speaker's bootstrap speech models in which the utterance was out-of-bag (i.e., the roughly 1 e bootstrap resamples in which u was not drawn). For each bootstrap draw, the likelihood of utterance u is computed under the trained neutral and skeptical models, then converted to predicted tone probabilities of u. Predicted probabilities are then averaged over models. Results reflect the performance of a classifier that uses 1 − 1 e ≈ 63% of the full training set. Across all speakers, we find that 68% of utterances are correctly classified (F 1 = 0.554). Speaker-specific results and other measures of performance are reported in Figure 5, along with measures of text classifier performance discussed in Appendix 4.2.
To assess the difficulty of the task, we contrasted the performance of supervised audio and text classifiers with that of non-expert human coders. A total of 40 native English speakers were recruited on a crowdworking site and assigned to one of eight justices (five coders per justice). Coders listened to all training utterances for their assigned justice, attempting to recover ground-truth labels. Figure 5 reports results from this evaluation in two ways. First, non-expert predictions were aggregated by majority vote, producing a set of committee predictions that were 70% correct, on average. We then disaggregated non-expert coders and found that individuals were able to recover the ground-truth label in  when a justice makes no utterances toward a particular side. This procedure is repeated for "extremely unpleasant" words, or words in the bottom "pleasantness" decile of DAL, to form the second textual measure. The most common pleasant and unpleasant words in Supreme Court questioning, defined in this way, are reported in Table 2. Key divergences from Black et al. (2011) are that (1) we use the most recent DAL (Whissell, 2009), rather than the original (Whissell, 1989), and (2)  The voting outcome is regressed on the three directed affect covariates defined above; our expectations are that directed pleasantness textual measure will correlate positively with the voting outcome, whereas the directed unpleasantness textual measure and the directed skepticism auditory measure will correlate negatively. Figure 6 (duplicated below for ease of reference) reports coefficients on directed-affect covariates from three linear probability model specifications: (1) a "baseline" with no controls; (2) justice fixed effects, which absorb general liberal or conservative leanings; and (3) justice fixed effects and case fixed effects, which additionally absorb deficiencies in one side's legal arguments. We regard (3) as a particularly stringent test. All results are reported with standard errors clustered on case.
Across all specifications, we consistently find that the "pleasantness" textual measure is not significantly correlated with voting, thus replicating one result from Black et al. (2011).
We also replicate their finding that the "unpleasantness" textual measure is negatively associated with voting, as expected, although it loses statistical significance when including case fixed effects. However, directed skepticism, as measured in the audio, is a far stronger predictor of voting patterns: a one-standard-deviation increase in this measure is associated with a change in voting that is consistently three times larger than the corresponding increase for unpleasantness, and this finding is robust across all specifications considered.
32  Figure 6: Predicting justice votes with directed skepticism and directed affective language. Horizontal errorbars represent point estimates and 95% confidence intervals from regressions of justice votes on directed pleasant words, directed unpleasant words, and our audio-based directed skepticism. Red circles correspond to a specification with no additional controls; blue triangles report results from a specification with speaker fixed effects; and black squares are from a specification with speaker and case fixed effects.

Predicted Skepticism by Justice, Issue, and Target
In this section, we present exploratory analyses of how each justice differentially expresses skepticism depending on the legal issue and the target side's ideology. Table 3 presents issue areas in which justices appear to be strongly ideological, based on patterns of directed skepticism.
We first compute average skepticism within groups of utterances, using predicted values obtained from dynamic specification described in Section . Groups are defined by unique combinations of justice, issue area, and ideology of the target side. The latter measures are based on Supreme Court Database classifications (Spaeth et al., 2014). (Note that the specification does not include issue area or justice-issue interactions. We regard these results as suggestive, and scholars interested in issue-specific speech patterns are encouraged to model this behavior explicitly to avoid inadvertently attenuating estimates.) Within each justice-issue, we then compare the average level of skepticism in utterances directed toward the liberal and conservative sides; results are reported for justice-issues with a substantial difference. Consistent with their known ideological predispositions, Justices Breyer and Ginsburg consistently express greater skepticism toward the conservative side, and Justice Scalia expresses greater skepticism toward the liberal side. However, the issue areas in which we observe strong ideological disparities vary by justice and appear to track the intensity of justice preferences. For example, Scalia holds strong views on the right to free speech, and this position manifests in the eight-percentage-point higher use of skepticism toward liberal advocates on First Amendment cases, relative to conservative advocates.
Similarly, Justice Ginsburg is seen as a strong defender of civil rights, and uses five per-35 centage points more skepticism toward conservative advocates on this issue. (Table 3 also highlights major differences in justice baselines; the notoriously stone-faced Justice Ginsburg uses relatively little discernible skepticism in general, so that this gap represents a two-thirds increase.) Finally, the table shows that Kennedy-consistent with his position as median voter-can be more skeptical of either the conservative or liberal side, depending on issue area.

communication R Package
In this section, we briefly describe our accompanying R package, communication (Duarte et al., 2020). Because the package is continually maintained and continues to be extended, researchers interested in conducting analyses with MASS will be best served by the latest package documentation. Here, we note the high-level features of our accompanying package and describe innovations over existing software.

36
First and most importantly, communication includes an efficient C++ implementation of MASS, the model that is the primary focus of this paper. To our knowledge, there is no other structural model available for the analysis of speech dynamics.
Second, communication implements a number of preprocessing steps that, while not the focus of this paper, are critical for any applied research using speech data. Among many other utilities, these include input/output functions compatible with common file formats; fast extraction of auditory features that are generally understood to distinguish abstract categories of human communication; objects for corpus and metadata management; and functions for segmentation and human labeling of utterances. Notably, these tools are made available in R for the first time, increasingly the lingua franca of computational social scientists.