Skip to main content Accessibility help
×
Hostname: page-component-77f85d65b8-6c7dr Total loading time: 0 Render date: 2026-04-19T21:18:52.023Z Has data issue: false hasContentIssue false

Descriptive vs. Inferential Community Detection in Networks

Pitfalls, Myths and Half-Truths

Published online by Cambridge University Press:  04 July 2023

Tiago P. Peixoto
Affiliation:
Central European University, Vienna

Summary

Community detection is one of the most important methodological fields of network science, and one which has attracted a significant amount of attention over the past decades. This area deals with the automated division of a network into fundamental building blocks, with the objective of providing a summary of its large-scale structure. Despite its importance and widespread adoption, there is a noticeable gap between what is arguably the state-of-the-art and the methods which are actually used in practice in a variety of fields. The Elements attempts to address this discrepancy by dividing existing methods according to whether they have a 'descriptive' or an 'inferential' goal. While descriptive methods find patterns in networks based on context-dependent notions of community structure, inferential methods articulate a precise generative model, and attempt to fit it to data. In this way, they are able to provide insights into formation mechanisms and separate structure from noise. This title is also available as open access on Cambridge Core.

Information

Figure 0

Figure 1 Difference between descriptive and inferential approaches to data analysis. As an analogy, in panels (a) and (b) we see two representations of the Cydonia Mensae region on Mars. Panel (a) is a descriptive account of what we see in the picture, namely a face. Panel (b) is an inferential representation of what lies behind it, namely a mountain (this is a more recent image of the same region with a higher resolution to represent an inferential interpretation of the figure in panel (a)). More concretely for the problem of community detection, in panels (c) and (d) we see two representations of the same network. Panel (c) shows a descriptive division into 13 assortative communities. In panel (d) we see an inferential representation as a degree-constrained random network, with no communities, since this is a more likely model of how this network was formed (see Fig. 2).

Figure 1

Figure 2 Descriptive community detection finds a partition of the network according to an arbitrary criterion that bears in general no relation to the rules that were used to generate it. In (a) is shown the generative model we consider, where first a degree sequence is given to the nodes (forming “stubs”, or “half-edges”) which then are paired uniformly at random, forming a graph. In (b) is shown a realization of this model. The node colors show the partition found with virtually any descriptive community detection method. In (c) is shown another network sampled from the same model, together with the same partition found in (b), which is completely uncorrelated with the new apparent communities seen, since they are the mere byproduct of the random placement of the edges. An inferential approach would find only a single community in both (b) and (c), since no partition of the nodes is relevant for the underlying generative model.

Figure 2

Figure 3 Inferential community detection considers a generative process (a), where the unobserved model parameters are sampled from prior distributions. In the case of the DC-SBM, these are the priors for the partition P(b), the number of edges between groups P(e|b), and the node degrees, P(k|e,b). Finally, the network itself is sampled from its model, P(A|k,e,b). The inference procedure (b) consists on inverting the generative process given an observed network A, corresponding to a posterior distribution P(b|A), which then can be summarized by a marginal probability that a node belongs to a given group (represented as pie charts on the nodes).

Figure 3

Figure 4 Inferential community detection aims to find a partition of the network according to a fit of a generative model that can explain its structure. In (a) is shown a network sampled from a stochastic block model (SBM) with 6 groups, and where the group assignments were hidden from view. The node colors show the groups found via Bayesian inference of the SBM. In (b) is shown another network sampled from same SBM, together with the same partition found in (a), showing that it carries a substantial explanatory power — very differently from the example in Fig. 2 (c).

Figure 4

Figure 5 Compression points towards the true model. (a) Protein-protein interaction network for the organism Meleagris gallopavo [22]. The node colors indicate the best partition found with the DC-SBM/TC [23] (there are more groups than colors, so some colors are repeated), and the edge colors indicate whether they are attributed to triadic closure (red) or the DC-SBM (black). (b) Description length values according to different models. The unknown true model must yield a description length value smaller than the DC-SBM/TC, and no other model should be able to provide a superior compression that is statistically significant.

Figure 5

Figure 6 Modularity maximization systematically overfits, and finds spurious structures even its own null model. In this example we consider a random network model with N=103 nodes, with every node having degree 5. (a) Distribution of modularity values for a partition into 15 groups chosen at random, and for the optimized value of modularity, for 5000 networks sampled from the same model. (b) Adjacency matrix of a sample from the model, with the nodes ordered according to a random partition. (c) Same as (b), but with the nodes ordered according to the partition that maximizes modularity.

Figure 6

Figure 7 The resolution limit of modularity maximization prevents small communities from being identified, even if there is sufficient statistical evidence to support them. Panel (a) shows a network with B=30 communities sampled from an assortative SBM parametrization. The colors indicate the 18 communities found with modularity maximization, where several pairs of true communities are merged together. Panel (b) shows the inference result of an assortative SBM [24], recovering the true communities with perfect accuracy. Panels (c) and (d) show the results for a similar model where a larger community has been introduced. In (c) we see the results of modularity maximization, which not only merges the smaller communities together, but also splits the larger community into several spurious ones — thus both underfitting and overfitting different parts of the network at the same time. In (d) we see the result obtained by inferring the SBM, which once again finds the correct answer.

Figure 7

Figure 8 Modularity maximization incorporates a substantial amount of noise into its results. (a) Number of groups found using modularity maximization for 263 empirical networks as a function of the number of edges. The dashed line corresponds to the 2E upper bound due to the resolution limit. (b) The same as in (a) but with randomized versions of each network. (c) Correspondence between the number of groups of the original and randomized network. The dashed line shows the diagonal.

Figure 8

Figure 9 Using modularity maximization is equivalent to performing a maximum likelihood estimate of the DC-PP model with very specific parameter choices, that depend on the number of edges E in the network and the number of communities B. In (a) we show the valid choices of ωin and ωout obtained when the solid and dashed lines cross, corresponding respectively to Eqs. 26 and 28, where we can see that for B=2 no solution is possible where the expected modularity is positive. In (b) we show the two possible values for the expected modularity that are consistent with the implicit model assumptions, as a function of the number of groups.

Figure 9

Figure 10 Generalized modularity and the DC-PP model are only equivalent if the symmetry of Eq. 35 is preserved. Here we consider an instance of the DC-PP model with ωin=2Ec/N, ωout=2E(1−c)/∑r≠snrns, and θi=1/nbi, where nr is the number of nodes in group r. The parameter c∈[0,1] controls the degree of assortativity. For non-uniform group sizes, the symmetry of Eq. 35 is not preserved with this choice of parameters. We use the parametrization nr=Nαr−1(1−α)/(1−αB), where α>0 controls the group size heterogeneity. When employing generalized modularity, we choose the closest possible parameter choice with ωin=2Ec/(∑rer2/2E) and ωout=2E(1−c)/(2E−∑rer2/2E), where er=∑ikiδbi,r. In (a) we show the inference results for the uniform case with α→1, where both approaches are identical, performing equally well all the way down to the detectability threshold [6] (vertical line). In (b) we show the result with α=2, which leads to unequal group sizes, causing the behavior between both approaches to diverge. In all cases we consider averages over 5 networks with N=104 nodes, average degree 2E/N=3, and B=10 groups.

Figure 10

Figure 11 Consensus clustering of a maximally random network, sampled from the Erdős-Rényi model, that combines 105 solutions of the maximum modularity method. On each node there is a pie chart describing the frequencies with which it was observed in a given community, obtained using the approach described in Ref. [26]. Despite the lack of latent communities, there is a substantial agreement between the different answers.

Figure 11

Figure 12 The statistical significance of the maximum modularity value is not informative of the significance of the community structure. In (a) we show the distribution of optimized values of modularity for networks sampled from the Erdős-Rényi (ER) model with the same number of nodes and edges as the network shown in (b) and (c). The vertical line shows the value obtained for the partition shown in (b), indicating that the network is very unlikely to have been sampled from the ER model (P=0.002). However, what sets this network apart from typical samples is the existence of a small clique of six nodes that would not occur in the ER model. The remaining communities found in (b) are entirely meaningless. In (c) we show the result of inferring the SBM on this network, which perfectly identifies the planted clique without overfitting the rest of the network.

Figure 12

Figure 13 Modularity maximization imposes characteristic community sizes in a manner that hides heterogeneity. Panel (a) shows the overlap between the true and obtained partition for the network described in Fig. 7, as a function of the resolution parameter γ. Panels (b) to (e) show the partitions found for different values of γ, where we see that as smaller groups are uncovered, bigger ones are spuriously split. The result is that no value of γ allows the true communities to be uncovered.

Figure 13

Figure 14 The NFL theorem involves predominantly instances of the community detection problem that are strictly incompressible, i.e. the true partitions cannot be used to explain the network. In (a) we show a typical sample of the uniform problem space given by Eq. 47, for N=100 nodes, which yields a dense maximally random network, randomly divided into B=72 groups. It is asymptotically impossible to use this partition to compress this network into fewer than Σmin(A|b)=N2=4950 bits, and therefore the partition is not learnable from the network alone with any inferential algorithm. We show also the description length of the SBM conditioned on the true partition, ΣSBM(A|b), as a reference. In (b) we show an example of a community detection problem that is solvable, at least in principle, since ΣSBM(A|b)<Σmin(A|b). In this case, the partition can be used to inform the network structure, and potentially vice-versa. This class of problem instance has a negligible contribution to the sum in the NFL theorem in Eq. 45, since it occurs only with an extremely small probability when sampled from the uniform model of Eq. 47. It is therefore more reasonable to state that the network in example (b) has an actual community structure, while the one in (a) does not.

Figure 14

Figure 15 A common interpretation of the NFL theorem for community detection is that it reveals a necessary trade-off between algorithms: since they all have the same average performance, if one algorithm does better than another in one set of instances, it must do worse on a equal number of different instances, as depicted in panel (a). However, in the actual setting considered by the NFL theorem there is no meaningful trade-off: asymptotically, all algorithms perform maximally poorly for the vast majority of instances, as depicted in panel (b), since in these cases the network structure is uninformative of the partition. If we constrain ourselves to informative problem instances (which compose only an infinitesimal fraction of all instances), the NFL theorem is no longer applicable.

Figure 15

Figure 16 Inferential algorithms show competitive performance with descriptive ones. In panel (a) is shown the run-time of the Leiden algorithm [117] and the agglomerative MCMC [118] for modularity, and three SBM parametrizations: planted partition (PP), degree-corrected SBM, and nested degree-corrected SBM (NSBM), for 38 empirical networks [65]. All experiments were done on a laptop with an i9-9980HK Intel CPU, and averaged over at least 10 realizations. The dashed line shows an O(Elog2E) scaling. In (b) are shown the same run times, but relative to the Leiden algorithm. The horizontal dashed lines show the median values.

Figure 16

Figure 17 Comparison of run times between MCMC and BP on laptop with an i9-9980HK Intel CPU, for a network of flights between airports, with N=3188 nodes and E=18833. We used the agglomerative algorithm of Ref. [118], and initialized BP with the model parameters found with MCMC. The dashes line shows a B2 slope.

Figure 17

Figure 18 Comparison of run times between MCMC and spectral clustering using the Laplacian matrix, on a laptop with an i9-9980HK Intel CPU, for the Anybeat social network [126], with N=12645 vertices and E=49132 edges. We used the agglomerative algorithm of Ref. [118] and the ARPACK eigenvector solver [127].

Save element to Kindle

To save this element to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Descriptive vs. Inferential Community Detection in Networks
  • Tiago P. Peixoto, Central European University, Vienna
  • Online ISBN: 9781009118897
Available formats
×

Save element to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Descriptive vs. Inferential Community Detection in Networks
  • Tiago P. Peixoto, Central European University, Vienna
  • Online ISBN: 9781009118897
Available formats
×

Save element to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Descriptive vs. Inferential Community Detection in Networks
  • Tiago P. Peixoto, Central European University, Vienna
  • Online ISBN: 9781009118897
Available formats
×