A generalized hypothesis test for community structure in networks

Researchers theorize that many real-world networks exhibit community structure where within-community edges are more likely than between-community edges. While numerous methods exist to cluster nodes into different communities, less work has addressed this question: given some network, does it exhibit statistically meaningful community structure? We answer this question in a principled manner by framing it as a statistical hypothesis test in terms of a general and model-agnostic community structure parameter. Leveraging this parameter, we propose a simple and interpretable test statistic used to formulate two separate hypothesis testing frameworks. The first is an asymptotic test against a baseline value of the parameter while the second tests against a baseline model using bootstrap-based thresholds. We prove theoretical properties of these tests and demonstrate how the proposed method yields rich insights into real-world data sets.


Introduction
Networks are everywhere in the modern world.From social media (Kane et al., 2014;Guo et al., 2020) to infrastructure (Mason and Verwoerd, 2007) to epidemiology (Leitch et al., 2019), many fields gather and analyze network data.The growth of the discipline of network science in the past two decades has brought along with it an interesting phenomenon: despite being observed in vastly different fields, many networks share similar structural properties (Fotouhi et al., 2019).One of the most common structural properties is community structure, or the occurrence of tightly-knit groups of nodes.Because communities can yield insights into node characteristics shared within these groups, various community detection methods have been proposed to assign nodes to communities, including spectral methods (Ng et al., 2002;Rohe et al., 2011;Jin, 2015;Sengupta and Chen, 2015) and greedy algorithms (Clauset et al., 2004;Blondel et al., 2008).While the community detection problem has received substantial attention, less work has considered whether a given network demonstrates a statistically meaningful community structure.In other words, is the near-ubiquity of community structure in observed networks informative about the underlying data-generating mechanisms, or is it simply spurious false positive results due to random noise or some unrelated feature such as degree heterogeneity?
As a motivating example, consider the classical influence maximization (IM) task where the goal is to select a small number of seed nodes such that the spread of influence is maximized on the entire network (Kempe et al., 2003;Yanchenko et al., 2023).In networks lacking community structure, Osawa and Murata (2015) showed that seeding nodes based on simple, centrality-based heuristics yields seed sets with substantial spread.On the other hand, when networks exhibit a community structure, selecting nodes from the same community results in ineffective seed sets due to nodes sharing many similar neighbors.In this case, more sophisticated seeding algorithms are needed.Thus, understanding the significance of the community structure in the network is paramount for adequate seed selection in the IM problem.
This work aims to develop a formal statistical hypothesis test for community structure.The framework of statistical hypothesis testing consists of four fundamental components: (1) the model parameter of interest, (2) a test statistic that is typically based on an estimator of the model parameter, (3) a null model that reflects the absence of the property of interest, and (4) a rejection region for the test statistic.While previous literature exists on this problem, there has been a minimal emphasis placed on ingredients (1) and (3).In particular, none of the existing work identifies an underlying model parameter and the null model has not been studied thoroughly.Two popular choices for nulls are the Erdős-Rényi (ER) (Bickel and Sarkar, 2016;Yuan et al., 2022) and configuration model (Lancichinetti et al., 2010;Palowitch et al., 2018;Li and Qi, 2020).While these testing methods carry rigorous statistical guarantees, choosing these null models means that they effectively test against the null hypothesis that the network is generated from a specific null model, rather than testing against the null hypothesis that there is no community structure.This leads to problems in empirical studies because the ER model, for example, is so unrealistic that almost all real-world networks would diverge from it, leading to many false positives.On the other hand, it may be possible for a configuration model to have a small amount of community structure itself.Thus, a careful treatment of the null model is essential for a statistical test to be relevant for applied network scientists.
The main contributions of this paper follow the four components of the hypothesis testing framework.From first-principles, we describe a model-agnostic parameter based on expected differences in edge densities which forms the basis of our statistical inference framework.Second, we propose an intuitive and interpretable test statistic which is directly connected to the model parameter.We leverage the model parameter and test statistic to formulate two types of hypothesis tests.The first is based on a user-specified threshold value of the parameter, which induces a model-agnostic test.For the second type, instead of specifying a baseline value of the parameter, the user specifies a baseline model or network property to test against.We derive theoretical results for the asymptotic cutoff in the first test and bootstrap cutoff in the second.Finally, we apply our method to well-studied realworld network datasets in the community structure literature.The results are insightful, as our method yields rich, new insights about the underlying network structure.Source code for this work is available on GitHub: https://github.com/eyanchenko/NetHypTest.The roadmap for the rest of this paper is as follows: in Section 2 we propose the model parameter and corresponding estimator, as well as present the first (asymptotic) hypothesis test.Section 3 discusses the baseline model test with a bootstrap threshold.We apply the method to synthetic data in Section 4 and real world datasets in Section 5. We close by discussing the method in Section 6.
2 Model parameter and baseline-value testing framework

Notation
For this work, we will only consider simple, unweighted and undirected networks with no self loops.Consider a network with n nodes and let A denote the n × n adjacency matrix where A ij = 1 if node i and node j have an edge, and 0 otherwise.We write A ∼ P as shorthand for A ij |P ij ∼ Bernoulli(P ij ) for 1 ≤ i < j ≤ n and define a community assignment to be a vector c ∈ {1, . . ., K} n such that c i = k means node i is assigned to community k ∈ {1, . . ., K}.We also introduce the following notation: for scalar sequences a n and b n ,

Expected Edge Density Difference (E2D2) parameter and estimator
The first step of the hypothesis test is to identify the model parameter.Since there is no universal metric to quantify community structure, we construct one from the first principles that applies to a large class of network models.For an observed network, a natural global measure of the strength of the community structure is the difference between the intra-and inter-community edge densities (see page 83-84 of Fortunato, 2010).The larger this difference, the more prominent the community structure is in the network.Now, this definition makes sense at the sample level for a realized network, but we seek the model parameter at the population level, or the parameter that generates this network.For this, we propose the Expected Edge Density Difference (E2D2) parameter.Consider a network model P and let c be a community assignment.Define where 1(•) is the indicator function.Here, pin (c) and pout (c) are the average expected intraand inter-community edge densities, respectively, hence the name Expected Edge Density Difference.This definition is sensible only when 1 < K < n for if K = 1, then pout (c) is ill-defined and the same is true for pin (c) if K = n.Intuitively, a data-generating mechanism with large pin (c) − pout (c) is likely to produce a network with a large difference in observed intra-and inter-community edge density and, therefore, prominent community structure.This difference, however, should be adjusted with respect to the overall sparsity of the network and the number of groups each node can be assigned.So, we propose the E2D2 parameter γ(c, P ) as where p = i>j P ij / n 2 is the overall probability of an edge between nodes in the network.We also define as the maximum value of the E2D2 parameter where the maximization is taken over the candidate community assignments c.In the Supplemental Materials, we sketch a proof showing that pin (c) − pout (c) ≤ pK.Thus, the K −1 term ensures that the E2D2 parameter is always less than one.Indeed, γ(c, P ) = 1 if and only if pout = 0 and there are K equallysized communities.By construction, this parameter has a natural connection to the intuitive notion of community structure since larger values correspond to more prominent levels of community structure in the data-generating process.One of the key advantages of the E2D2 parameter is that it is general and model-agnostic.By model-agnostic, we mean that the definition applies to any network-generating model where edges are conditionally independent given the model parameters P ij .This encompasses a wide-range of models including: Erdős-Rényi (Erdős and Rényi, 1959), Chung-Lu (Chung and Lu, 2002), stochastic block models (Holland et al., 1983), degree corrected block models (Karrer and Newman, 2011), latent space models (Hoff et al., 2002), random dot product graphs (Athreya et al., 2017) and more.Some other models such as the Barabási-Albert (Barabási and Albert, 1999), configuration (Fosdick et al., 2018) and exponential random graph models (Robins et al., 2007), however, do not fit into this framework.While by no means applicable to all network models, the general definition of the E2D2 parameter, requiring only the conditional independence of edges, gives the parameter great flexibility.
The second ingredient in our hypothesis testing recipe is an estimator of the E2D2 pa-rameter.Using the same notation as above, define (4) Then we estimate γ(c, P ) from (2) as where p = i>j A ij / n 2 and pin (c), pout (c) and p are the sample versions of pin (c), pout (c), and p, respectively.In other words, T (c, A) is the observed edge density difference, the sample version of the E2D2 parameter.Below we find the maximum of this test statistic over all possible community labels c so we also introduce the notation Since p depends on A but not on c, T (A) maximizes the intra-community edge probability over the candidate values of c with a penalty for larger inter-community edge probability, akin to the objective function in Mancoridis et al. (1998).This metric has a natural connection to the well-known Newman-Girvan modularity quantity (Newman, 2006).The modularity, Q(c, A), of a network partition c is defined as where d i is the degree of node i and m is the total number of edges in the network.Rearranging (7), we see that The connection is now immediate: the first term is the (scaled) number of intra-community edges and has a one-to-one relationship with pin (c).The second term can be thought of as the penalty term for the expected number of edges for a random network with given degree sequence.In light of these similarities, the proposed estimator has a key advantage compared to modularity.The penalty term for modularity assumes the configuration model as the null model, i.e., comparing the strength of the community structure against a random network with identical degree sequence.The penalty term for the proposed method pout (c), however, is not model-dependent.Hence, any model could be chosen as the null model, giving the proposed estimator far greater flexibility than modularity.The numerator of T (c, A) can also be written in the general modularity formulation of Bickel and Chen (2009).

Algorithm for computing the E2D2 estimator
The E2D2 estimator is also of independent interest as an objective function for community detection.Finding the maximum of T (c, A) is a combinatorial optimization problem with O(K n ) solutions.Thus, an exhaustive search is clearly infeasible for even moderate n so we propose a greedy, label-switching algorithm to approximate T (A).We briefly explain the ideas here and present the full algorithm in Algorithm 1. First, each node is initialized with a community label c i ∈ {1, . . ., K}.Then, for each node i, its community assignment is switched with all neighboring communities.The new label of node i is whichever switch yielded the largest value of the E2D2 estimator (or it is kept in the original community if none of the swaps increased T (c, A)).This process repeats for all n nodes.The algorithm stops when all nodes have been cycled through and no labels have changed.The current labels, c, are then returned.The assumption that K is known is rather strong and unrealistic for most real-world networks.We view it as reasonable here, however, since the goal of this work is not primarily to propose a new community detection algorithm.In practice, any off-the-shelf method can be used to estimate K before using the proposed algorithm.
Algorithm 1 Greedy Result: Community labels c Input: n × n adjacency matrix A, number of communities K Initialize labels c ∈ {1, . . ., K} n run = 1 while run > 0 do run = 0 Randomly order nodes for i in 1, . . ., n do Find neighboring communities, K i , of node i:

Baseline-value test
We now leverage the E2D2 parameter and estimator to formulate our first hypothesis test.Because this parameter is interpretable and meaningful as a descriptor of the networkgenerating process, we consider the scenario where the researcher has a problem-specific benchmark value of the E2D2 parameter that she would like to test against.In other words, the baseline value has domain-relevant meaning as "no community structure."Then we must determine whether any assignment exceeds this threshold so the formal test is: for some γ 0 ∈ [0, 1).Naturally, we reject for some cutoff C that depends on the network size n and null value γ 0 .Now, to obtain a test with level α, we should set C as the (1 − α) quantile of the null distribution of T (A).This depends on the data-generating matrix P under the null hypothesis, and so implicitly also depends on γ(P ) = γ 0 .But this is a difficult task since the test statistic is the maximum taken over O(K n ) possible community assignments and these random variables are highly correlated.We propose to sidestep this difficult theoretical problem with an asymptotic cutoff.We first make the following three assumptions.A1.For any candidate community assignment, at least two community sizes must grow linearly with n.A2.The number of communities K n is known, but is allowed to diverge.A3.
A1 lays down a basic requirement for any legitimate candidate community assignment, since otherwise, one community will dominate the entire network, while A2 shows that our theory holds even if the number of communities K n goes to infinity.A3 is a sparsity requirement for p that depends on the asymptotics of K n .If K n ≡ K is fixed (e.g.Bickel and Chen, 2009;Sengupta and Chen, 2018), then we require n 1/2 p → ∞.A typical sparsity assumption is that p = O(n −1 ) such that the expected number of edges in the network grows linearly with n.Our result requires a stronger condition which means we are in the semi-dense regime.While this is not ideal, proofs in the dense regime (fixed p) are common in the literature (e.g., Bickel and Sarkar, 2016).Our work, in fact, holds under less stringent conditions, i.e., n 1/2 p → ∞ but allows p → 0. Indeed, this assumption implies that the test has larger power when the network is denser and/or has more communities since this leads to a smaller cutoff, as we see in the formal result that now follows.
Theorem 2.1.Let A ∼ P and consider testing H 0 : γ(P ) ≤ γ 0 as in (9).Let A1 and A2 be true and consider the cutoff where k n = {(log K n )/n} 1/2 and arbitrarily small ϵ > 0 chosen by the user.Then when the null hypothesis is true (γ(P ) ≤ γ 0 ), the type-I error goes to 0, i.e., for any η > 0, If the alternative hypothesis is true (γ(P ) > γ 0 ), then the power goes to 1, i.e., A proof of the theorem, as well as proofs of all subsequent theoretical results, are left to the Supplemental Materials.The proof approximates the cutoff under the null hypothesis using a union bound and then leverages Hoeffding's inequality to show that the probability of failing to reject under the alternative hypothesis goes to 0. The cutoff depends on K and, for theoretical purposes, we assume that K is known.In practice, we run a community detection algorithm on the network (e.g., Fast Greedy algorithm of Clauset et al., 2004) and then use the number of communities returned by this algorithm, K, to find T (A) and construct the threshold.Additionally, A3 ensures that the k n /(K p) term converges to 0 such that C → γ 0 as n → ∞.In practice, we confirm the necessity of this assumption as this test has larger power for denser networks (large p) and more communities.
This test is fundamentally different from existing ones in the literature because the practitioner chooses the value of γ 0 for her particular problem, inducing a model-agnostic test.In other words, the null hypothesis is not a baseline model, but instead a baseline quantity of community structure in the network-generating process.Since the E2D2 parameter has a natural connection to community structure, this test is also more easily interpretable with respect to this feature.Indeed, rejecting the null hypothesis means that the data-generating matrix for the observed network has greater community structure (as measured by the E2D2 parameter) than some baseline value (γ 0 ).
A natural question that arises is how to choose γ 0 .We stress that this choice depends on the domain and question of interest.There are some special cases, however, that yield insights into selecting a meaningful value.For example, assume that the practitioner believes that her network has two roughly equal-sized communities.Then setting γ 0 = (µ−1)/(µ+1) is equivalent to testing whether the average intra-community edge density is more than µ > 1 times larger than the average inter-community edge density, i.e., pin ≥ µp out .
The special case of γ 0 = 0 is also worthy of further discussion.It is trivial to show that if P is from an Erdős-Rényi (ER) model (Erdős and Rényi, 1959) where P ij = p for all i, j, then γ(P ) = 0. We show in the Supplementary Materials, however, that the converse of this statement is also true, i.e., γ(P ) = 0 only if P is from an ER model.This means that setting γ 0 = 0 is equivalent to testing against the null hypothesis that the network is generated from an ER model.In other words, any other network model will reject this test when γ 0 = 0.But there are many models (e.g., Chung-Lu (Chung and Lu, 2002), small world (Watts and Strogatz, 1998)) which may not be ER but also do not intuitively have community structure.This connection between the model-agnostic E2D2 parameter and its behavior under certain model assumptions motivates the test in the following section.

Baseline-model test
In the previous section, we derived a hypothesis test based on a user-defined benchmark value that does not refer to a null model.There may be situations, however, where the practitioner does not have a meaningful way to set the null parameter γ 0 .In this case, we set the null hypothesis in reference to a particular null model and/or model property.If P (ϕ) is the true data-generating model for A defined by the parameters ϕ, then instead of testing γ(P (ϕ)) ≤ γ 0 , the null hypothesis is now that γ(P (ϕ)) is less than or equal to the largest value of the E2D2 parameter under the null model.Since the specific set of parameters for the null model is typically unknown, they are estimated from the observed network as φ.To fix ideas, we provide the following two examples.
Erdős-Rényi model: The simplest null model to consider is the ER model where P ij = p for all i, j.Thus, the value of p completely defines an ER model.We estimate p by taking the average edge probability of the network, i.e., p = i<j A ij /{n(n − 1)/2}.Then this test determines whether the observed value of the E2D2 estimator is greater than what could arise if the network was generated from an ER model.Recall that this is equivalent to the test in (9) setting γ 0 = 0.
Chung-Lu model: Another sensible null model to consider is the Chung-Lu (CL) model (Chung and Lu, 2002) where P ij = θ i θ j for some weight vector θ = (θ 1 , . . ., θ n ) which uniquely defines the model.This model is similar to the configuration model except instead of preserving the exact degree sequence, it preserves the expected degree sequence.We estimate θ using the rank 1 Adjacency Spectral Embedding (Sussman et al., 2012): where λ is the largest-magnitude eigenvalue of A and û is the corresponding eigenvector.Now the test is whether the value of the E2D2 estimator is greater than what likely would have been observed if the CL model generated the observed network.

Bootstrap test
To carry out this test, we propose a bootstrap procedure.We describe the method for the CL null but full details for the ER and CL null can be found in Algorithm 2. Additionally, the bootstrap test can be trivially modified to test against many other null models.This approach directly estimates the 1 − α quantile of the null distribution of T (A) with a parametric bootstrap and then uses this quantity as the testing threshold.In particular, we first compute the test statistic T (A) as in ( 6).Since the null distribution of T (A) is unknown, we must simulate draws from this distribution in order to have a comparison with our observed test statistic.So we next estimate θ with the adjacency spectral embedding (ASE) as in ( 12 and we reject H 0 if the p-value is less than a pre-specified α. We highlight a nuanced but important distinction between the interpretation of the baseline-value and the baseline-model test.For each bootstrap iteration, we re-estimate the number of groups K * b that is then used when computing T * b = max c {T (A * b , c)}.Therefore, it is possible that for a given iteration, K * b ̸ = K, where K was estimated from the original network A. Thus, the results of the bootstrap test are unconditional on the value of K. Rejecting against the ER null, for example, means that the observed network has greater community structure than could be expected from a network generated from an ER model, and for any K used to compute T (A).Conversely, the baseline-value test results are conditional on K. Therefore, rejecting this test means that the observed network has stronger community structure than a network model with γ(P ) = γ 0 could generate with K groups.
Now, the bootstrap test naturally fits into the existing community detection testing literature as it considers a specific null model.Indeed, to the knowledge of the authors, this is the first1 statistical hypothesis test for the significance of communities in a degree corrected block model (Karrer and Newman, 2011) as the CL model serves as the null model of "no communities."Nevertheless, the bootstrap method's strength is its flexibility, owing in part to the general definition of the E2D2 metric, in that it can easily accommodate any null model2 .Additionally, it allows for more general inference as multiple, realistic null distributions can be tested against, leading to a rich understanding of the network's community structure.The bootstrap test also yields an insightful visual tool where the observed value of the test statistic is plotted next to the bootstrap histogram.This tool helps in visualizing how the strength of community structure observed in the network compares to benchmark networks with the same density, or with the same degree distribution, and so on.We study these plots more in Section 5. Lastly, a fundamental challenge of bootstrapping networks is that, in general, we only observe a single network.If we knew the model parameters (e.g., p or θ) then it would be trivial to generate bootstrap replicates.Since the true model parameters are unknown, we must first estimate them and then generate networks using the estimated parameters.Thus, the quality of the bootstrap procedure depends on the quality of these estimates.In the following sub-section, we formally prove certain properties of this procedure.

Bootstrap theory
We now turn our attention to theoretical properties of the bootstrap.We want to show that, if A, H ∼ P (ϕ) and Â * ∼ P ( φ) where φ estimates ϕ using A, then T ( Â * ) converges to T (H).To have any hope of showing this result, Â * must be similar to A. We consider the Wasserstein p-distance and adopt the notation of (Levin and Levina, 2019).Let p ≥ 1 and let A 1 , A 2 be adjacency matrices on n nodes.Let Γ(A 1 , A 2 ) be the set of all couplings of A 1 and A 2 .Then the Wasserstein p-distance between A 1 and A 2 is where where Π n is the set of all n × n permutation matrices and ∥A∥ 1 = i,j |A ij |.The following results show that Â * converges in distribution to A in the Wasserstein p-distance sense for both the ER and CL null.For all results in this section, assume that model parameters do not depend on n, i.e., p n = p ̸ → 0.
Lemma 3.1.Let A, H ∼ ER(p) and Â * ∼ ER(p) where p = i,j A ij /{n(n − 1)}.Then Lemma 3.2.Let A, H ∼ CL(θ) and Â * ∼ CL( θ) where θ is found using (12).Then Both proofs are based on Theorem 5 in Levin and Levina (2019).These results show that for large n, the networks generated from the bootstrap model are similar to the networks generated from the original model.Since there is only one parameter to estimate in the ER model but n in the CL model, it is sensible that the rate of convergence is much faster for the former.
Next, we show that for a fixed c, the distribution of the bootstrapped test statistic converges to the same distribution as the original test statistic.We only show this result for the ER null.First, we introduce some useful notation.Let t(A, c) be the numerator of the E2D2 test estimator, i.e., where m in and m out are the total possible number of intra-and inter-community edges, respecitvely, and Additionally, let K be the number of groups that a node can be assigned, which is assumed to be fixed3 .Then we have the following result.
The proof is a simple application of the non-identically distributed central limit theorem and iterated expectations.This lemma implies that the E2D2 estimator T (A, c) consistently estimates the E2D2 model parameter γ(A, c) for a particular c.Additionally, the distribution of the test statistic converges to the same normal distribution, whether the network was generated from the original model or the bootstrap model.This result is more difficult to show for the CL null model.We cannot use the ideas from the proof of Lemma 3.3 because θ has a more complicated form and the bootstrap step is more involved than that of the ER null; nor can we use the results in Levin and Levina (2019) because the E2D2 estimator cannot be written as U -statistic.
Ideally, we would like to show that this result also holds when using the community assignment which maximizes the E2D2 estimator.Unfortunately, showing this convergence for arbitrary statistics is difficult (e.g., Levin and Levina, 2019).This is challenging in our particular case for several reasons.First, the E2D2 estimator T (A) is the maximum of O(e n ) statistics T (A, c i ), meaning the maximum is taken over a set of random variables which goes to infinity.Additionally, these variables are non-trivially correlated since they depend on the same adjacency matrix.Another angle to view the difficulty of this problem is that, for c * = arg max c {T (A, c}, C * ij is now dependent on A ij .Even computing the mean and variance of this estimator becomes difficult.Bootstrap theoretical results for the maximum of the test statistic is an important avenue for future work.

Settings
We now study the performance of the proposed method on synthetic data.Our primary metric of interest is the rejection rate of the test under different settings.We consider two settings for the baseline-value test as well as settings for the baseline-model test with both the ER and CL nulls.In each setting, we first fix γ(P ) and increase the number of nodes n.Then we fix n and increase γ(P ) and, in both cases, we expect an increasing rejection rate.We run 100 Monte Carlo simulations and compute the fraction of rejections.Our bootstrap method uses B = 200 bootstrap samples and we fix the level of the test at α = 0.05.We chose the two Spectral methods proposed in Bickel and Sarkar (2016) as benchmarks since these are leading and well-established methods with formal guarantees.Even though the authors suggest only using the adjusted method, we will still compare both since, similar to our proposed framework, the authors propose a version of the test with an asymptotic threshold and an adjusted version of the test with a bootstrap correction.

Test against baseline value
First, we consider the baseline-value test using Theorem 2.1, i.e., we reject H 0 if where k n = {(log K)/n} 1/2 and ϵ = 0.0001.We let n = 1000, 1200, . . ., 3000 and generate networks with K communities.When n = 1000, we let K = 2 where 60% and 40% of the nodes are in each community, respectively.When 1000 < n ≤ 2000, K = 4 with 40%, 20%, 20% and 20% of the nodes distributed in each community.For n > 2000, we have K = 6 with 25%, 15%, . . ., 15% of nodes in each community.The edge probabilities P ij are distributed such that where c * corresponds to the true community labels.We set (p l , p u ) = (0.075, 0.095), (0.100, 0.120), and (0.125, 0.145) when K = 2, 4 and 6, respectively, in order to fix E{γ(P )} ≈ 0.20 for all parameter combinations.This data-generating model has a block structure but with heterogeneous edge probabilities, which means P is not an SBM.We test against the null hypothesis H 0 : γ(P ) ≤ γ 0 = 0.10.The results are in Figure 1(a).Since E{γ(P )} > γ 0 , the test should reject.The cutoff is asymptotic, however, so the test has low power for small n.But when n > 2000, the test has a high power as n is large enough for the asymptotic results to apply.
For the next setting, we fix n = 5000 and let K = 4 with 40%, 20%, . . ., 20% of the nodes in each community.The edge probabilities P ij are now distributed such that where a = 0.10, 0.09, . . ., 0.01.We find that E{γ(P )} = 0.08, 0.09, . . ., 0.23 for different values of a and test against the null hypothesis H 0 : γ(P ) ≤ γ 0 = 0.10.The results are in Figure 1(b).When a = 0.10, γ(P ) = 0.08 < γ 0 = 0.10 so we would expect the test to fail to reject which it does.The test should have a high rejection rate when a = 0.07 since γ(P ) = 0.12 > γ 0 .The power of the test, however, does not increase to one until γ(P ) = 0.15.This small discrepancy is due to the fact that it is an asymptotic cut off and we are generating networks with a finite number of nodes.When γ(P ) ≥ 0.15, the test consistently rejects as expected.

Test against ER null
Next, we study the baseline-model test using the bootstrap procedure described in Algorithm 2. First we test against the null hypothesis that the network was generated from the ER model.We showed that this test is also equivalent to the asymptotic test with γ 0 = 0 so we can also compare the rejection threshold from Theorem 2.1.A natural alternative model to the ER is the stochastic block model (SBM) (Holland et al., 1983) where P ij = B c i ,c j and c corresponds to the true community labels.We let n = 250, 500, 1000, 1500, 2000 and generate networks from an SBM with K communities.For n < 1000, we set K = 2 with 60% and 40% of the nodes in each community.K = 4 with a 40%, 20%,. . .,20% split for 1000 ≤ n < 2000, and K = 6 for n = 2000 with 25%, 10%,. . .,10% of nodes in each community.We fix the inter-community edge probability B ij = 0.025 for i ̸ = j and set B ii = 0.05, 0.075, 0.10 when K = 2, 4, 6, respectively, for i = 1, . . ., K.This ensures a fixed γ(P ) ≈ 0.33.The results are in Figure 1 (c).Since the network is generated from an SBM and we are comparing with an ER null, we expect the test to yield a large rejection rate.We can see that both Spectral methods and the proposed bootstrap approach having an increasing rejection rate with increasing n and where the Spectral methods have a larger power.The asymptotic method has a large rejection rate for n = 250 but then drops to zero for n = 500 before increasing again at n = 1000.The reasons for this is that for n = 250, the number of communities K is being overestimated.This causes the test statistic to inflate more than the cutoff, leading to a large rejection rate.For n ≥ 500, K is more accurately estimated so we see trends that are expected.
In the second scenario, we fix n = 1000.Now, the intra-community edge probability is B 11 = B 22 = 0.05 and inter-community edge probability is B 12 = 0.05, 0.04, . . ., 0.01.This means that γ(P ) = 0, 0.11, 0.25, 0.42, 0.65.The results are in Figure 1 (d).When γ(P ) = 0, the network is generated from an ER model meaning the null hypothesis is true so we expect a low rejection rate.The unadjusted Spectral method has large Type I error and the bootstrap method rejects slightly more than the α level of the test.When γ(P ) > 0, the null hypothesis is false so we expect a large rejection rate.Both Spectral methods reach a power of one by γ(P ) = 0.25 while the bootstrap method does not reach this power until γ(P ) = 0.42.The asymptotic test, as expected, is the most conservative test.While the Spectral methods outperform the bootstrap test in both scenarios, this is to be expected because these methods were designed for this exact scenario and null.The proposed test is more general so it can be applied to many more settings, but is unlikely to beat a method designed for a specific scenario.

Test against CL null
Lastly, we study the baseline-model test where now the CL model functions as the null, again using Algorithm 2. We drop both Spectral methods for these simulations as the ER null is hard-coded into them, thus making these methods inapplicable for testing against the CL null.We also drop the asymptotic test from the comparison because it is unclear how to set γ 0 for this null model.In fact, finding a closed-form expression for the rejection threshold for the CL null is an interesting avenue of future work.For the alternative model, we consider the degree corrected block model (DCBM) (Karrer and Newman, 2011) where P ij = θ i θ j B c i ,c j and θ i are node specific degree parameters.We generate networks from a DCBM with n = 250, 500, . . ., 1000.For n ≤ 500, we let K = 2 where 60% and 40% of the nodes are in each community.For n > 500, K = 3 with 40%, 30%, and 30% of the nodes distributed in each community.The degree parameters are generated θ i iid ∼ Uniform(0.2,0.3) and B ii = 1 for i = 1, . . ., K. In order to preserve γ(P ) ≈ 0.33, B ij = 0.5, 0.4 when K = 2, 3, respectively, for i ̸ = j.The results are in Figure 1(e).Since the networks are generated from a DCBM, we expect a large rejection rate.We see that that rejection rate increases monotonically with n, reaching a power of one by n = 1000.
For the second scenario, we fix K = 4 with 40%, 20%, .1: The number of nodes n and edges m for real-world networks.T (A) is the observed value of the E2D2 parameter and γ 0 is the largest null value such that the baseline-value test would be rejected.Additionally, we report the p-values for the adjusted Spectral method and Bootstrap method against different null hypotheses (ER=Erdős-Rényi, CL=Chung-Lu).

Real data analysis
We now study the proposed method on two networks: DBLP and hospital interactions.
The DBLP is a computer science bibliography website and this network was extracted by Gao et al. (2009) and Ji et al. (2010).Here, each node represents an author and an edge signifies that the two authors attended the same conference.Additionally, we only consider two author's research areas, databases and information retrieval.The hospital network (Vanhems et al., 2013) captures the interacts between patients and healthcare providers at a hospital in France.Each person is represented by a node and an edge signifies that they were in close proximity.
For each data set, we consider several metrics.First, we compute T (A) and find the largest value of γ 0 such that the test in ( 9) is still rejected.We also compute the p-value and find the bootstrap histogram of the test statistic for the ER and CL null models (with B = 1, 000).Considering both null hypotheses together allows us to gain a richer understanding of the network.We compare the proposed method to the p-value from the adjusted Spectral method (Bickel and Sarkar, 2016).See Table 1 for numeric results and Figure 2 for histograms from the bootstrap method.While Table 1 provides a succinct summary, the plots in Figure 2 provide more details and insights.In these plots, the observed test statistic computed from the dataset is plotted as a vertical line along with histograms representing bootstrap distributions from various benchmark models.These simple but informative plots give practitioners a reference of how the observed community structure compares to the range of community structure in various benchmarks.
For the DBLP network, since we only selected two research areas, we set K = 2 to find T (A).The observed value of T (A) = 0.75 is quite large and, due to the network's size and density, would reject the base-line value test up to γ 0 = 0.73.This means that if P generated this network, then we can assert that γ(P ) ≥ 0.73.Additionally, all p-values for the model-based tests are effectively zero and, moreover, the observed test statistic is in the  far right tail of both bootstrap null distributions.Since both tests are rejected, then it is very unlikely that the network was generated from an ER or CL model and, instead, implies that there is highly significant community structure in this network.This finding accords with existing literature (e.g., Sengupta and Chen, 2018).
For the Hospital network, the test of H 0 : γ 0 = 0 is rejected, as are both tests with ER null.This is a sensible result since we showed that γ 0 = 0 is equivalent to testing against the ER null.The p-value for the CL null, however, is not significant at the α = 0.05 level, meaning that this network does not have more community structure than we would expect to occur from a CL network by chance.So while there is strong evidence that the network diverges from an ER model, these findings indicate that perhaps the low ER-null p-values are due to degree heterogeneity rather than community structure.Indeed, the histogram shows how the test statistic is very unlikely to have been drawn from the ER distribution, but is reasonably likely to have come from the CL distribution since this distribution has a greater mean and variance.Using an ER null alone would have led to the conclusion that there is community structure in this network.By using multiple nulls together with the proposed method, however, we gain a fuller understanding of the network by concluding that degree heterogeneity may be masquerading as community structure.

Discussion
In this work, we proposed two methods to test for community structure in networks.These tests are rooted in a formal and general definition of the E2D2 parameter.This metric is simple, flexible, and well-connected to the conceptual notion of community structure which we argue makes it a more principled approach.In fact, the test statistic can even be used as a descriptive statistic to quantify the strength of community structure in a network.Existing methods are based on specific random graph models, such as the ER model, which are implicitly presumed to be the only models that do not have community structure.While our second testing approach fits into this framework as well, the general nature of the E2D2 parameter means that we can test against nearly any null model (subject to conditional edge independence) to obtain a richer set of practical insights compared to existing methods.Given a network, we recommend that practitioners first carry out the test against the ER null.If this test is rejected, further tests should be carried out to check whether the it could be due to some other network feature like degree heterogeneity.Thus, the method not only helps decide whether the network appears to exhibit community structure but also helps understand the source of this ostensible community structure.
There are several interesting future research directions.First, the proposed E2D2 parameter and bootstrap testing framework could be adapted for sequential testing.In Ghosh and Barnett (2023), the authors propose a general framework for sequentially testing for H 0 : K vs. H 1 : K + 1 communities in the network.Using the proposed parameter, we would first find T (A) setting K = 2, and then generate bootstrap samples with K = 1 (ER model).If this test is rejected, we find T (A) with K = 3 and generate bootstrap samples from an SBM with K = Note that scaling the E2D2 parameter by K ensures a fair comparison of the metric across different values of K.This process continues until we fail to reject the null hypothesis.Since our proposed bootstrap procedure yields a valid p-value, we can directly apply the results in Ghosh and Barnett (2023) to ensure a pre-specified error tolerance.
Additionally, the proposed E2D2 parameter is currently limited to quantifying assortative community structure.Extending the method to handle disassortative and/or bi-partite networks would be a interesting contribution.Next, while the asymptotic test is more adept to scale to large networks, the bootstrap test is limited to networks of up to (roughly) n = 10, 000 nodes for computational feasibility.There are also open theoretical questions including a more precise asymptotic cutoff that accounts for correlation between the random variables as well as bootstrap theory for the maximum test statistic.Moreover, the ideas from this work could be extended to test for other network properties like core-periphery structure (Borgatti and Everett, 2000). where From the proof of the upper bound on the E2D2 parameter, we have that m in,i = O(n 2 ) and m out,i = O(n 2 ).Thus, letting k Now, under the null hypothesis, E(U i ) ≤ ξ 0 .Then we have We also have since n 1/2 p → ∞.Combining these two results we have that P (T * > c) ≤ P (U * > ξ 0 + k) + P (S < p/(1 + ϵ)) ≤ η (34) as we hoped to show.
Under H 1 , let γ 1 = ξ 1 /p and let T oracle = T (c γ , A) = U oracle /S where c γ = arg max c {γ(c, P )}, i.e., c γ is the community assignment which maximizes the E2D2 parameter.This is reasonable because we assume that the algorithm finds the global maximum T (A) so T oracle ≤ T (A).We will use a similar approach to the proof of H 0 noting that For the first part, this is equivalent to showing that if γ(c, P ) < 0 for some c, then there exists some c ′ such that γ(c ′ , P ) > 0. If there exists some c such that γ(c, P ) < 0, then 1 But this means that there is some P ij such that P ij ≥ P kl for all i ̸ = k or j ̸ = l and is strictly greater for at least one P kl .Thus, if we consider the community assignment c ′ where nodes i and j are in one community and all other nodes are in the other, then pin (c ′ ) > pout (c ′ ) and thus γ(c ′ , P ) > 0.
We will prove the second part by induction.Let n = 3 and we are given that γ(c, P ) = 0 for all c.We start by writing out the probability matrix.Plugging the first equation into the second equation we find: ( 1 2 (P 13 + P 23 ) + P 23 ) =⇒ P 13 = P 23 .
Plugging this into the first equation we have P 12 = P 13 = P 23 := p which means that this must be an ER model.Now assume that the claim holds for n − 1 and show it holds for n.For convenience, assume n is even but the proof can easily be extended if n is odd.Consider a network with n nodes such that γ(c, P ) = 0. Remove an arbitrary node such that we have a network with n − 1 nodes and apply the induction hypothesis, i.e.P ij = p for all i, j.We now add the removed node back to the network such that the node has probability P i,n of an edge between itself and node i for i = 1, . . ., n − 1.Thus, the probability matrix is: Since γ(c, P ) = 0 for all c, then we want to show that P i,n = p for i = 1, . . ., n. Assume for contradiction that P is not ER and we will show that γ(c, P ) ̸ = 0 for some c.Without loss of generality, let {P 1n , . . ., P n/2,n } be the smaller values of the last column and {P n/2+1,n , . . ., P n−1,n } be the larger values and consider the community assignment where nodes {1, . . ., n/2} are in one community and nodes {n/2 + 1, . . ., n} are in the other community.Then Using Jensen's inequality, Thus, ).Then, for b = 1, . . ., B, we draw θ * b = ( θ * b1 , . . ., θ * bn ) T with replacement from θ = ( θ1 , . . ., θn ) T , generate a CL network A * b with θ * b and find T * b = max c {T (A * b , c}.The empirical distribution of { T * b } B b=1 serves as a proxy for the null distribution of T (A) so the p-value is p Lemma 3.3.Let A, H ∼ ER(p) and Â * ∼ ER(p) where p = i<j A ij /{n(n − 1)/2} and consider a fixed c. Furthermore, let s 2 n

Figure 2 :
Figure 2: Histograms of bootstrap samples from the proposed method for the two real data The orange histogram is with the Erdős-Rényi null, and the blue histogram is with the Chung-Lu null.The vertical line (black) indicates the value of the test statistic.