Hostname: page-component-89b8bd64d-rbxfs Total loading time: 0 Render date: 2026-05-09T23:01:17.818Z Has data issue: false hasContentIssue false

Almost exact recovery in noisy semi-supervised learning

Published online by Cambridge University Press:  11 November 2024

Konstantin Avrachenkov
Affiliation:
Inria Sophia Antipolis, 2004 Rte des Lucioles, Valbonne, France
Maximilien Dreveton*
Affiliation:
School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
*
Corresponding author: Maximilien Dreveton; Email: maximilien.dreveton@epfl.ch
Rights & Permissions [Opens in a new window]

Abstract

Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the maximum a posteriori (MAP) estimator for clustering a degree corrected stochastic block model when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press.
Figure 0

Algorithm 1. Semi-supervised learning with regularized adjacency matrix.

Figure 1

Figure 1. Cost in Algorithm 1 with the standard and normalized versions of the constraint, on 50 realizations of SBM with $n = 500, p_{\rm out} = 0.03$ and 50 labeled nodes with $10\%$ noise.

Figure 2

Figure 2. Average accuracy obtained by different semi-supervised clustering methods on DC-SBM graphs, with $n = 2000,\ p_{\rm in} = 0.04$, and $ p_{\rm out} = 0.02$ with different distributions for θ. The number of labeled nodes is equal to 40. Accuracies are computed on the unlabeled nodes, and are averaged over 100 realizations; the error bars show the standard error.

Figure 3

Figure 3. Average accuracy obtained on a subset of the MNIST data set by different semi-supervised algorithms as a function of the oracle-misclassification ratio, when the number of labeled nodes is equal to 10. Accuracy is averaged over 100 random realizations, and the error bars show the standard error.

Figure 4

Figure 4. Average accuracy obtained on the unlabeled, correctly labeled, and wrongly labeled nodes by the oracle. Simulations are done on the 1,000 digits (2,4). The noisy oracle correctly classifies 24 nodes and misclassifies 16 nodes, and the boxplots show 100 realizations.

Figure 5

Figure 5. Average accuracy obtained on real networks by different semi-supervised algorithms as a function of the oracle-misclassification ratio. The number of labeled nodes is 30 for Political Blogs and LiveJournal, and 100 for DBLP. Accuracy is averaged over 50 random realizations, and the error bars show the standard error.

Figure 6

Table 1. Parameters of the real data sets. n1 (resp., n2) corresponds to the size of the first (resp., second) cluster, and $|E|$ is the number of edges of the network.