Hostname: page-component-5db58dd55d-htx7c Total loading time: 0 Render date: 2026-06-03T05:36:26.068Z Has data issue: false hasContentIssue false

Spectral Clustering with Likelihood Refinement for High-Dimensional Latent Class Recovery

Published online by Cambridge University Press:  18 February 2026

Zhongyuan Lyu
Affiliation:
Columbia University , USA
Yuqi Gu*
Affiliation:
Department of Statistics, Columbia University , USA
*
Corresponding author: Yuqi Gu; Email: yuqi.gu@columbia.edu
Rights & Permissions [Opens in a new window]

Abstract

Latent class models (LCMs) are widely used for identifying unobserved subgroups from multivariate categorical data in social sciences, with binary data as a particularly popular example. However, accurately recovering individual latent class memberships remains challenging, especially when handling high-dimensional datasets with many items. This work proposes a novel two-stage algorithm for LCMs suited for high-dimensional binary responses. Our method first initializes latent class assignments by an easy-to-implement spectral clustering algorithm, and then refines these assignments with a one-step likelihood-based update. This approach combines the computational efficiency of spectral clustering with the improved statistical accuracy of likelihood-based estimation. We establish theoretical guarantees showing that this method is minimax-optimal for latent class recovery in the statistical decision theory sense. The method also leads to exact clustering of subjects with high probability under mild conditions. As a byproduct, we propose a computationally efficient consistent estimator for the number of latent classes. Extensive experiments on both simulated data and real data validate our theoretical results and demonstrate our method’s superior performance over alternative methods.

Information

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Figure 1 An illustration for spectral clustering: row vectors of $\mathbf {U}\boldsymbol {\Sigma }$ (left) and $\widehat {\mathbf {U}}\widehat {\boldsymbol {\Sigma }}$U^Σ^ (right). Setting: $N=500$N=500, $J=250$J=250, and $K=3$K=3.Figure 1 long description.

Figure 1

Algorithm 1: Algorithm 1 long description.

Figure 2

Algorithm 2: Algorithm 2 long description.

Figure 3

Algorithm 3: Algorithm 3 long description.

Figure 4

Table 1 Clustering error across different methods and settings under $100$100 replicatesTable 1 long description.

Figure 5

Figure 2 Simulation 1: Mis-clustering proportions versus number of items J. Entries of \ $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta} (5,5)$Beta(5,5).Figure 2 long description.

Figure 6

Figure 3 Simulation 2: Mis-clustering proportions versus number of items J. Entries of \ $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta} (1,8)$Beta(1,8).Figure 3 long description.

Figure 7

Figure 4 Simulation 3: Failure rate versus the number of items J. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta} (1,8)$Beta(1,8).Figure 4 long description.

Figure 8

Table 2 Simulation 4: Running time (seconds) of different methodsTable 2 long description.

Figure 9

Figure 5 Simulation 4: Running time (seconds) of different methods.Figure 5 long description.

Figure 10

Figure 6 Simulation 5-1: Mis-clustering proportions versus number of items J under $N=J$N=J. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta}(1,8)$Beta(1,8).Figure 6 long description.

Figure 11

Figure 7 Simulation 5-2: Mis-clustering proportions versus number of items J under $N=0.5J$N=0.5J. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta}(1,8)$Beta(1,8).Figure 7 long description.

Figure 12

Figure 8 Simulation 5-3: Mis-clustering proportions versus number of items J under $N=100$N=100. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta}(1,8)$Beta(1,8).Figure 8 long description.

Figure 13

Figure 9 Simulation 5-4: Mis-clustering proportions versus sample size N under $J=100$J=100. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta}(1,8)$Beta(1,8).Figure 9 long description.

Figure 14

Figure 10 Simulation 5-5: Mis-clustering proportions versus sample size N under $J=30$J=30. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta}(1,8)$Beta(1,8).Figure 10 long description.

Figure 15

Figure 11 Simulation 5-6: Mis-clustering proportions versus number of items J under imbalanced latent classes with $N=J$N=J. Entries of $\boldsymbol {\Theta }$Θ are independently generated from $\text {Beta}(1,8)$Beta(1,8).Figure 11 long description.

Figure 16

Figure 12 Simulation 5-7: Mis-clustering proportions versus number of items J under $N=J$N=J. The $\boldsymbol {\Theta }$Θ matrix is set to a fixed design matrix.Figure 12 long description.

Figure 17

Table 3 Proportion of successfully selecting $\widehat{K}$K^ to be the true K based on $200$200 simulation replicates, under different generative mechanisms of the item parameters $\boldsymbol \Theta =(\theta _{j,k})_{J\times K}$Θ=(θj,k)J×KTable 3 long description.

Figure 18

Table 4 Mis-clustering error and running time (seconds) of different methods on Senate Roll Call Votes dataTable 4 long description.