Hostname: page-component-77c78cf97d-57qhb Total loading time: 0 Render date: 2026-04-23T13:25:03.222Z Has data issue: false hasContentIssue false

Analyzing Political Text at Scale with Online Tensor LDA

Published online by Cambridge University Press:  04 December 2025

Sara Kangaslahti
Affiliation:
Department of Computer Science, Harvard University, USA
Danny Ebanks
Affiliation:
Institute for Quantitative Social Science, NVIDIA Corp., USA
Jean Kossaifi
Affiliation:
Johns Hopkins University, USA
Anqi Liu
Affiliation:
Department of Computer Science, California Institute of Technology, USA
R. Michael Alvarez*
Affiliation:
Division of Humanities and Social Science, California Institute of Technology, USA
Animashree Anandkumar
Affiliation:
Division of Engineering and Applied Sciences
*
Corresponding author: R. Michael Alvarez; Email: rma@caltech.edu
Rights & Permissions [Opens in a new window]

Abstract

This article proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, tensor latent Dirichlet allocation, that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3$\times $–4$\times $ those of prior parallelized latent Dirichlet allocation methods), and that it scales linearly to text datasets with over a billion documents; and iii) we provide an open-source, GPU-based implementation of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus, this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.

Information

Type
Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology
Figure 0

Figure 1 Evolution of the most prominent pro- and counter-movement topics in the #MeToo discussion.Note: In each iteration of the dynamic analysis described in Section 7.2, we inspect the topics and manually label them, as well as classify them as pro- or counter-#MeToo. We then display the topic in each category with the highest weight $\alpha _i$ below.

Figure 1

Table 1 Runtime of our TLDA method on GPU for 260 million and 1.04 billion documents using the COVID dataset.

Figure 2

Figure 2 Evolution of most prominent political topics in the #MeToo discussion.Note: In each iteration of the dynamic analysis detailed in Section 7.2, we inspect the topics, manually label them, and classify them as political or not political. We display the political topic with the highest weight $\alpha _i$ below.

Figure 3

Figure 3 Overview of our approach.Note: As batches of documents arrive, incrementally, they are first pre-processed (they are stemmed, tokenized, and the vocabulary is standardized). We then create a dataset of the counts for each word in each document. We then find the average number of times each word appears in each document (the average word occurrence, which is the first moment $M_1$) and subtract the value of $M_1$ from our existing word-frequency matrix. The resulting document term matrix is our centered dataset, X (Section 4.1). We then perform a singular value decomposition on the centered data, X, to recover whitening weights without ever needing to calculate $M_2$, directly. This saves computational overhead while being mathematically equivalent. We then use these whitening weights to transform the centered data, X, which can be done incrementally (Section 4.3). Finally, we construct the whitened equivalent of the third-order moment, $M_3$, which is updated, directly in this factorized form (Section 4.4). This learned factorization can be directly unwhitened and uncentered to recover the classic solution to TLDA (Section 1) and recover the topics and their associated word probabilities (Section 4.6).

Figure 4

Table 2 Comparison of topic recovery on synthetic data for various TLDA methods.

Figure 5

Table 3 Comparison of CPU runtime on synthetic data for various TLDA methods.

Figure 6

Figure 4 Tweets per month in the #MeToo data, in millions.

Figure 7

Table 4 TLDA convergence timing comparison on the full #MeToo dataset.

Figure 8

Figure 5 Runtime comparison for TLDA on GPU vs. Gensim for the full #MeToo corpus and varying numbers of topics.Note: This shows that the runtime of our method scales nearly constantly with respect to the number of topics, while Gensim scales more than linearly.

Figure 9

Figure 6 TLDA vs. Gensim fitting time.Note: We compare the time to fit Gensim’s LDAMulticore and our online TLDA, not including pre-processing, for 100 topics. We plot the runtime in seconds as a function of the size of the subset from the #MeToo dataset, from 1 million to 7.97 million tweets.

Figure 10

Table 5 Comparison of CPU and GPU runtime on the #MeToo dataset (7.97 million tweets).

Figure 11

Figure 7 Number of tweets per day.

Figure 12

Figure 8 Topical composition over time: In this figure, we report the average share of tweets belonging to one of three main categories of topics with a greater than 90% probability.

Supplementary material: File

Kangaslahti et al. supplementary material

Kangaslahti et al. supplementary material
Download Kangaslahti et al. supplementary material(File)
File 1.5 MB