Hostname: page-component-77f85d65b8-8wtlm Total loading time: 0 Render date: 2026-04-17T19:01:36.044Z Has data issue: false hasContentIssue false

Consensus embedding for multiple networks: Computation and applications

Published online by Cambridge University Press:  30 May 2022

Mengzhen Li*
Affiliation:
Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA
Mustafa Coşkun
Affiliation:
Department of Computer Engineering, Abdullah Gül University, Kayseri, Turkey
Mehmet Koyutürk
Affiliation:
Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA
*
Corresponding author: Mengzhen Li, email: mxl994@case.edu
Rights & Permissions [Opens in a new window]

Abstract

Machine learning applications on large-scale network-structured data commonly encode network information in the form of node embeddings. Network embedding algorithms map the nodes into a low-dimensional space such that the nodes that are “similar” with respect to network topology are also close to each other in the embedding space. Real-world networks often have multiple versions or can be “multiplex” with multiple types of edges with different semantics. For such networks, computation of Consensus Embeddings based on the node embeddings of individual versions can be useful for various reasons, including privacy, efficiency, and effectiveness of analyses. Here, we systematically investigate the performance of three dimensionality reduction methods in computing consensus embeddings on networks with multiple versions: singular value decomposition, variational auto-encoders, and canonical correlation analysis (CCA). Our results show that (i) CCA outperforms other dimensionality reduction methods in computing concensus embeddings, (ii) in the context of link prediction, consensus embeddings can be used to make predictions with accuracy close to that provided by embeddings of integrated networks, and (iii) consensus embeddings can be used to improve the efficiency of combinatorial link prediction queries on multiple networks by multiple orders of magnitude.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. The framework for the computation of consensus embeddings. The k network versions represent networks with the same node set but different sets of edges. Our framework for the computation of consensus embeddings assumes that embeddings for each network were computed separately, possibly using embedding spaces with different number of dimensions. It then computes a d-dimensional consensus embedding that represents the superposition of the k versions, to be used for downstream analysis tasks.

Figure 1

Figure 2. Illustration of dimensionality reduction methods used to compute consensus embeddings.

Figure 2

Figure 3. The computation of consensus embeddings using Canonical Correlation Analysis (CCA). (a, b) Red and blue networks with identical sets of nodes labeled from 1 to 10. (c) The integrated network obtained by superposing the two networks. (d) 2D embeddings of the individual networks. Note that the embeddings are in different spaces and we arbitrarily match dimensions of embeddings to plot them in the same space. The colors and the numbers show the identity of the nodes and the network each node belongs to. (e) 2D embedding of the integrated network. (f) The embeddings of the two networks after space transformation via CCA. (g) The consensus embedding computed by taking the mean of the two transformed embeddings.

Figure 3

Figure 4. The correspondence between the dimensions of embeddings computed on two or multiple different networks. (a) The correlation matrix of the 16 embedding dimensions for the embeddings computed on $G_1$ and $G_2$ (Table 1). (b) The correlation matrix of the 16 embedding dimensions after transformation via cannonical correlation analysis (CCA). (c) The pairwise correlation matrices of the 16 embedding dimensions for the embeddings $G_1$, $G_2$, and $G_3$ before (top) and after (bottom) transformation via Generalized Canonical Correlation Analysis (GCCA).

Figure 4

Table 1. The description and size of the human protein–protein interaction (PPI) networks used in our experiments

Figure 5

Table 2. The description and size of the yeast protein–protein interaction (PPI) networks used in our experiments

Figure 6

Figure 5. Link prediction accuracy of consensus embeddings on pairs of networks with equal number of dimensions for individual node embeddings. All embeddings are computed using node2vec and results are displayed for (a) 32-dimensional and (b) 64-dimensional embeddings. Each matrix shows the link prediction accuracy, assessed in terms of area under ROC curve (AUC), of the respective embeddings for all $\binom{8}{2}=28$ pairs of networks. Integrated: The embedding of the network obtained by superposing the two networks. Baseline: The consensus embedding obtained by averaging the embeddings of the two networks. SVD, Autoencoder, CCA: The consensus embedding computed using each of the three dimensionality reduction methods we consider. Darker shade of blue indicates more accurate prediction performance.

Figure 7

Figure 6. Link prediction accuracy of consensus embeddings of pairs on networks with different number of dimensions in individual embeddings. Each beeswarm plots shows the distribution of area under ROC curve (AUC) for the link prediction performance of consensus embeddings computed using SVD (blue), Autoencoder (orange), CCA (green) across $8\times7=56$ (ordered) pairs of networks. Each panel shows a different number of dimensions for the individual embeddings of each network: (a) 16 vs. 32, (b) 16 vs. 64, (c) 16 vs. 128, (d) 32 vs. 64, (e) 32 vs. 128, (f) 64 vs. 128.

Figure 8

Figure 7. Accuracy of consensus embeddings in link prediction as a function of the number of networks that are integrated. The dataset of the experiments are the human PPI networks. Results are shown for Node2vec and Role2vec. For each point k on the x axis, each point in the plot shows the area under ROC curve (AUC) of link prediction for a specific combination of k network versions for $1\leq k \leq 8$. The lines show the average AUC across all combinations. The blue, yellow, red, and green points/lines, respectively, show the accuracy provided by the embeddings computed directly on the integrated network, consensus embeddings computed using SVD, consensus embeddings computed using variational autoencoder, consensus embeddings computed using GCCA, and the baseline consensus embedding (average of individual embeddings).

Figure 9

Figure 8. The link prediction accuracies as a function of dimensions for the human PPI data (table 1). Results are shown for Node2vec and Role2vec. The plot shows the link prediction AUC of the consensus embeddings of all the versions when the numbers of dimensions are 16, 32, 64, and 128.

Figure 10

Figure 9. The link prediction accuracies as a function of dimensions for the yeast PPI dataset (table 2). Results are shown for Node2vec and Role2vec. The plot shows the link prediction AUC of the consensus embeddings of all the versions when the numbers of dimensions are 16, 32, 64, and 128.

Figure 11

Figure 10. Runtime of combinatorial link prediction queries using consensus embeddings. The blue dots (for each combination)/curves (average of all combinations with the respective number of versions) show the query time corresponding to the “Network Integration at Query Time” approach described in Section 3.2, while the red and yellow dots/curves show the query time corresponding to the “Consensus Embedding at Query Time”. Results are shown for Node2vec and Role2vec, using four methods for computing consensus embeddings: SVD (yellow), Autoencoder (red), GCCA (green), and baseline concensus embedding (orange). The number of dimensions of the plots are 64.