The PageRank algorithm analysed by EJAM

In this post, the PageRank algorithm is analysed by Lorenz Kuger, following research from European of Applied Mathematics (EJAM).

Google’s PageRank algorithm is one of the most crucial factors in determining a website’s search engine ranking. The algorithm was developed by Google co-founder Larry Page, and first introduced in 1998. It is based on the idea that a website’s importance can be measured by the number of other websites that link to it. The more a site is linked to, the more important it is deemed to be, and therefore the higher its ranking will be in search results.

The purpose of the PageRank algorithm

The PageRank algorithm has been used not only by Google, but also by other search engines such as Bing and Yahoo!. In addition, many other websites and applications make use of this algorithm to determine rankings or scores for their own purposes. For example, Wikipedia uses PageRank algorithms to determine which articles are most relevant for certain keyword searches.

A mathematical analysis

Mathematically, the algorithm computes the PageRank vector, which has an interpretation of a Markov chain’s invariant distribution on the different sites. This Markov chain can be described by a random surfer model on the different websites. If the random surfer is based at, say, site A and site A links to sites B and C, it will in the next step jump to one of B or C with a certain probability, and otherwise be teleported to any other site. Finding the invariant distribution computationally reduces to applying the power method to the Markov chain’s transition matrix. The convergence speed and the invariant distribution depend heavily on the teleportation probability, and on the teleportation distribution.

A graph-based method

The PageRank algorithm is a graph-based method, since it is natural to model the different websites (nodes) and links (edges) between them as a directed graph. As such, it can be analysed using established tools from other graph models. In the last few years now, there have been considerable advances in the understanding of a variety of graph-based methods due to their numerous applications in unsupervised and semi-supervised machine learning tasks. One typical example is Laplacian regularization in semi-supervised learning, a method that propagates labels from a few, labelled data points on a graph to many unlabelled ones by minimizing a Dirichlet energy. The method gets its name from the unnormalized graph Laplacian, since solving the energy minimization task is equivalent to solving an operator equation featuring the graph Laplacian.

Theoretical interest

The problem with the PageRank algorithm is that the corresponding graph is directed, not undirected, since the fact that site A links to site B does not necessarily mean site B also links to site A. However, a lot of the theory on graph-based models in the literature only works on undirected graphs. The understanding of the behaviour of methods on directed graphs is smaller. A central point of theoretical interest is the asymptotic setting when the number of nodes grows to infinity, i.e., the discrete to continuum limit. This is a gap into which a new article by A. Yuan, J. Calder and B. Osting now enters. Their manuscript A continuum limit for the PageRank algorithm
was recently published in the European Journal of Applied Mathematics [1]. The main contributions of their work are a model for directed graphs, which allows analysis in the asymptotic regime, and discrete to continuum consistency results.

“A continuum limit for the PageRank algorithm”

European Journal of Applied Mathematics Cover
“A continuum limit for the PageRank algorithm” can be read for free in EJAM

The entry point to the authors’ analysis in “A continuum limit for the PageRank algorithm” is the observation that a little reformulation of the PageRank problem gives an operator equation involving something that resembles a graph Laplacian. Indeed, this ‘PageRank operator’ turns out to be a generalization of the random walk graph Laplacian, a re-weighted version of the unnormalized graph Laplacian.

Due to their ubiquity in graph-based methods, the continuum limits of graph Laplacians are well understood, for the number of nodes approaching infinity, they converge towards weighted Laplace-Beltrami operators. This suggests that with a suitable model for the directed graphs, the continuum limit could be something similar, i.e., a PDE defined by an operator which coincides with the weighted Laplace-Beltrami operator in the special case of undirected graphs. This is precisely the main result of the paper.

The results of the paper

The authors first introduce a model for random geometric graphs that allows directional edges and coincides with existing models for undirected graphs. They obtain a PDE which is pointwise consistent with the discrete PageRank operator in the limit. The second-order elliptic PDE mirrors the behaviour that is expected from the discrete setting and the corresponding random surfer model. An advection term, which would be zero for undirected graphs, steers the advection along the directional edges in the graph.

The random walk to linked nodes generates the second-order diffusion. Depending on the relations of the teleportation parameter, the connectivity of the graph and a parameter steering the amount of directionality in the graph, one can easily observe that the PDE is governed either by the second-order diffusion term, the first-order advection term or the reaction term due to teleportation alone, the latter making the PDE trivial. In the case of vanishing connectivity, the authors also prove consistency with the first-order PDE that includes only the advection term due to the directionality. An interesting consequence of the consistency result is the spatial Lipschitz regularity of the PageRank vector, implying that the ranking of data points that are close to each other does not vary rapidly.

In the context of the random surfer model, the authors also construct the corresponding time dependent PDE. They describe the evolution of the distribution of the random surfer with a sufficiently smooth initial distribution in the spatially and temporally continuous setting. It amounts to a reaction-advection-diffusion PDE featuring the same terms as the elliptic PDE describing the invariant PageRank vector. The main analytic work is then devoted to proving not only pointwise, but also $L^\infty$ consistency results for every of the derived PDEs. The results are non-asymptotic and based on standard maximum principal arguments which have been used before in discrete to continuum results for graph Laplacians. Eventually, the study is completed by numerical tests validating the proven convergence rates in the asymptotic regime.

You can read the article “A continuum limit for the PageRank algorithm by A. Yuan, J. Calder and B. Osting for free through open access.


Leave a reply

Your email address will not be published. Required fields are marked *