Hostname: page-component-6766d58669-76mfw Total loading time: 0 Render date: 2026-05-24T10:12:54.743Z Has data issue: false hasContentIssue false

Spectral goodness-of-fit tests for complete and partial network data

Published online by Cambridge University Press:  04 September 2025

Shane Lubold
Affiliation:
Department of Statistics, University of Washington, Seattle, WA, USA
Bolun Liu
Affiliation:
Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
Tyler McCormick*
Affiliation:
Department of Statistics, University of Washington, Seattle, WA, USA
*
Corresponding author: Tyler McCormick; Email: tylermc@uw.edu
Rights & Permissions [Opens in a new window]

Abstract

Networks describe complex relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a data set well, and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodness-of-fit (GoF) test for dyadic data. We show that our method, when applied to a specific model of interest, provides a straightforward, computationally fast way of selecting parameters in a number of commonly used network models. For example, we show how to select the dimension of the latent space in latent space models. Unlike other network GoF methods, our general approach does not require simulating from a candidate parametric model, which can be cumbersome with large graphs, and eliminates the need to choose a particular set of statistics on the graph for comparison. It also allows us to perform GoF tests on partial network data, such as Aggregated Relational Data. We show with simulations that our method performs well in many situations of interest. We analyze several empirically relevant networks and show that our method leads to improved community detection algorithms.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (https://creativecommons.org/licenses/by-nc-nd/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use and/or adaptation of the article.
Copyright
© The Author(s), 2025. Published by Cambridge University Press
Figure 0

Figure 1. Distribution of statistic in Theorem 1 for $n = 50$ (left) and $n = 1000$ (right), where the solid curve in red corresponds to the Tracy-Widom distribution with $\beta = 1$. The difference in the distributions decreases as $n$ increases, but the convergence is slow. This motivates the bootstrapping correction algorithm, given in Algorithm1.

Figure 1

Figure 2. Correct classification rate for $n = 100, 200$ for the dimension of the latent space in Section 4.1. For a fixed $n$, increasing the dimension makes the problem harder and so the classification rate falls. However, the classification rate improves as we increase $n$.

Figure 2

Figure 3. Power function for the hypothesis in (12). The dotted-dashed curve corresponds to $n = 100$ and the dashed curve corresponds to $n = 200$. The null hypothesis is $\theta _3 = 0$. The horizontal line represent the $\alpha = 0.05$ threshold.

Figure 3

Figure 4. Left: Type I error of ER model via ARD. Right: Power of fitting SBM ARD to ER model. When the hypothesis model is correct, we observed a Type I error centered around the level of testing $\alpha = 0.05$. When the ARD of a more complex model is fitted to a simple hypothesis model (i.e. ER is a special case of SBM with one community), we will observe a very high power which grows with network size $n$.

Figure 4

Figure 5. Type I error and rejection rates for directed network data. The first row corresponds to the case of a directed Erdös-Rényi model. In the top left figure, we plot the average rejection rate over 50 sets of simulations for $n = 25, 50, 100$ using the bootstrap test from Supplementary Material A. In the top middle, we plot the average rejection rate using the exponential test statistic in Theorem 4. In the top right, we plot the average rejection rate using Tracy-Widom test statistic in Theorem 3. In the second row, we plot the average rejection rates using a directed stochastic block model (DSBM) with 2 communities and distinct cross community probabilities. We see that bootstrap and exponential methods have good Type I error, yet that of Tracy-Widom statistics are relatively larger. In terms of power against DSBM, bootstrap and Tracy-Widom obtain good power, but the Exponential does not. Overall, the bootstrap statistic has a better performance in general.

Figure 5

Figure 6. Misclassification rates of political blog, simmons college, and caltech data.

Supplementary material: File

Lubold et al. supplementary material

Lubold et al. supplementary material
Download Lubold et al. supplementary material(File)
File 596.7 KB