A continuum limit for the PageRank algorithm

Semi-supervised and unsupervised machine learning methods often rely on graphs to model data, prompting research on how theoretical properties of operators on graphs are leveraged in learning problems. While most of the existing literature focuses on undirected graphs, directed graphs are very important in practice, giving models for physical, biological, or transportation networks, among many other applications. In this paper, we propose a new framework for rigorously studying continuum limits of learning algorithms on directed graphs. We use the new framework to study the PageRank algorithm, and show how it can be interpreted as a numerical scheme on a directed graph involving a type of normalized graph Laplacian. We show that the corresponding continuum limit problem, which is taken as the number of webpages grows to infinity, is a second-order, possibly degenerate, elliptic equation that contains reaction, diffusion, and advection terms. We prove that the numerical scheme is consistent and stable and compute explicit rates of convergence of the discrete solution to the solution of the continuum limit PDE. We give applications to proving stability and asymptotic regularity of the PageRank vector.


Introduction
Due to its versatility in modeling data, graphs are frequently leveraged for applications in machine learning and data science. A graph structure encodes interdependencies among constituents, such as social media users, images or videos in a collection, or physical or biological agents, and provides a convenient representation for high dimensional data. For example, in a graph representing research collaborations, we can represent each author as a node in the graph, and co-authorship is represented by edges between nodes, with edge weights depending on the frequency of co-authorship. The resulting graph is undirected as the edges are bi-directional. On the other hand, transportation and biological networks often result in directed graphs because the relationship between two nodes is ordered, such as the direction of a train route or a predator-prey relationship.
Graph-based methods are particularly prominent in unsupervised and semi-supervised machine learning tasks that seek to reveal structures and patterns in unlabeled data. For example, in semi-supervised classification, one has labels for a subset of the nodes in the graph, and the problem is to propagate the labels to the rest of the graph in a meaningful way. A widely used and very successful algorithm for semi-supervised classification is Laplacian semi-supervised learning, originally proposed in [59], which finds the unique graph harmonic function that extends the labels. There are many extensions and modifications of Laplacian regularization (see, e.g., [1,3,47,54,56,57]), with more recent methods drawing inspiration from partial differential equations (PDEs) [6,23]. For classification problems at very low labeling rates, p-Laplacian regularization has recently been introduced [20,22]. In unsupervised learning, graph-based algorithms are used in spectral clustering [39,41], Laplacian eigenmaps [2], diffusion maps [17], manifold ranking [32,33,[51][52][53]58], minimal surface graph partitioning [60], and PageRank [30].
Various types of graph Laplacians appear in nearly all graph-based learning algorithms, due to the ability of the graph Laplacian to uncover geometric structure in datasets. Graph Laplacians that are commonly used in practice include the unnormalized Laplacian Lu(x) = y∈X ω xy (u(y) − u(x)) the random walk Laplacian L rw u(x) = 1 d x y∈X ω xy (u(y) − u(x)), and the normalized Laplacian where X denotes the set of nodes in the graph, u : X → R, ω xy is the (undirected) edge weight between x and y, and d x = y∈X ω xy is the degree of node x. The unnormalized graph Laplacian appears naturally as the gradient of the Dirichlet energy The random walk Laplacian is exactly the generator for a random walk on X with probability d −1 x ω xy of stepping from x to y, and the normalized graph Laplacian is a convenient way to obtain a symmetric normalization of the graph Laplacian. While these normalizations are most frequently used in practice, many other choices are possible. For example, see [36] for an analysis of how the choice of normalization affects spectral clustering. We note that the random walk interpretation allows us to view methods like those studied in [59] as performing classification by randomly walking on the graph until hitting a labeled node. Intuitively, the random walk will naturally learn the structure of the unlabeled data by remaining within clusters of high density for long enough to hit a labeled point, before moving to a different cluster. While many classification algorithms seek graph harmonic functions, the spectrum of graph Laplacians is widely used to construct low dimensional embeddings of graphs.
The algorithms discussed above are mainly designed for symmetric graphs, where ω xy = ω yx . Perhaps one of the most widely known algorithms for directed graphs is the PageRank algorithm, which is used to evaluate the importance of nodes in a graph based on their link structure. While the algorithm is most famous for sorting Google search results up until the mid-2000s, variants of PageRank are used by other tech companies (for example, Twitter uses a reversed PageRank to identify influential, topicspecific accounts), and have been adapted to solve problems in neuroscience, genetics, recommender systems, etc.. [30]. The PageRank algorithm uses a random surfer model with teleportation probability α ∈ [0, 1] to rank pages. To describe the model, when the random surfer is at webpage x, she will with probability α teleport to a random webpage, and with probability 1 − α click on an outgoing link to another webpage. When the surfer clicks on an outgoing link, the link is selected at random and we denote by p xy the probability of clicking a link to website y from website x. When she randomly teleports, the next website is chosen at random from a teleportation probability distribution v. The inclusion of the teleportation step ensures the random surfer does not get stuck in disconnected components of the graph.
The PageRank vector is the invariant distribution of the resulting Markov chain, which measures the amount of time the random surfer spends on each webpage. Webpages that are visited more often by the surfer are ranked more highly, while websites that are rarely visited are ranked lower. Mathematically, the PageRank vector r is the (normalized) solution of the eigenvector problem where P = (p yx ) x,y∈X is the probability transition matrix described above, v is the teleportation probability distribution, and 1 is the column vector of all 1's. We note that by the Perron-Frobenius Theorem [30], the PageRank vector r can be chosen to have real-valued strictly positive entries. If we choose the normalization 1 T r = 1, so that r is a probability distribution, then the eigenvector problem (1.1) is equivalent to the linear system This formulation is more convenient, since the left hand side can be interpreted as a type of graph Laplacian.
The teleportation probability distribution v can be uniform over all webpages, or can be nonuniform. Indeed, by setting v(x) = δ x 0 (x) for a specific website x 0 leads to a localized PageRank algorithm that ranks sites nearby x 0 [30]. Computationally, the PageRank vector is obtained via the power method on (1.1), which converges at a rate of |λ 2 /λ 1 |, i.e., a ratio of the second eigenvalue to the leading eigenvalue of the matrix. In the case of PageRank, Haveliwala and Kamvar [31] show that λ 1 = 1 and λ 2 = 1 − α, so the convergence rate depends heavily on the choice of the teleportation parameter. Google takes 1 − α to be .85 [38]. There are also adaptations of semi-supervised learning to directed graphs (see [55]).
Due to the ubiquity of graph Laplacians in graph-based learning problems, much work has been devoted to understanding how the graph Laplacian is able to uncover geometric and distributional structure from unlabeled data. To do this, one usually assumes the graph is a random geometric graph with n points and length scale h > 0, and considers the limit as n → ∞ and h → 0. This means the nodes in the graph are an i.i.d. sample of size n from a density ρ supported on a d-dimensional manifold M embedded in R D , and the weights ω xy are defined by where Φ : [0, ∞) → [0, ∞) is nonincreasing and usually compactly supported. The first results to appear in the literature were pointwise consistency results, showing that a graph Laplacian L applied to a smooth test function ϕ ∈ C 3 (M) converges, as n → ∞ and h → 0 to a weighted version of the Laplace-Beltrami operator for various values of a, b, c that depend on the choice of normalization of the graph Laplacian. For example, for the unnormalized graph Laplacian, a = −1, b = 2, c = 0, and for the random walk Laplacian a = −2, b = 2, c = 0. If h → 0 and n → ∞ simultaneously, then the condition nh d+2 ≫ log n is required for pointwise consistency, which ensures there are enough neighbors of each data point to apply appropriate concentration of measure results. To obtain O(h) pointwise consistency rates, it is required that nh d+4 ≫ 1. We contrast this with the condition nh d ≫ log n required for graph connectivity. For pointwise consistency results of this flavor, see [4,7,34,35,37,45]. Pointwise consistency was extended to k-nearest neighbor graph constructions in [48], which includes some mildly directed graphs due to antisymmetries in the k-nearest neighbor relation.
While pointwise consistency results are informative, they do not prove that the solutions of graph-based problems converge to solutions of their counterparts as n → ∞ and h → 0. This question is more subtle and requires further analysis. The problem of spectral convergence of the graph Laplacian spectrum to that of the Laplace-Beltrami operator has been well-studied. Belkin and Niyogi [5] established L 2 spectral convergence (convergence of eigenvalues and L 2 convergence of eigenvectors) when ρ is the uniform distribution and this was extended to non-uniform distributions with partial convergence rates in [50]. Shi [42] proved convergence rates and extended the analysis to include manifolds with boundary. The L 2 convergence rate was improved recently in [25] using variational methods, and then further improved in [13] to agree with the pointwise consistency rate O(h), which is the sharpest known L 2 spectral convergence rate. The variational parts of the spectral convergence arguments in [13,25] were heavily influenced by earlier work in a non-probabilistic setting [8]. We also mention that very recent work has established the first L ∞ eigenvector convergence rates [19].
For problems in clustering and semi-supervised learning, recent work has begun to address convergence in the continuum using tools from the calculus of variations and viscosity solutions of PDEs. Trillos and Slepčev [49] developed a Gamma-convergence framework for proving discrete to continuum convergence of graph-based problems, and the framework has been applied to prove discrete to continuum consistency in many problems (see, e.g., [24,27,28,40,46,49] and the references therein). Discrete to continuum convergence can also be established with the maximum principle and the viscosity solution framework, as was established in [9,11] for the game-theoretic graph p-Laplacian and Lipschitz learning. The maximum principle can also be used to prove asymptotic Hölder regularity of solutions to graph-based learning problems, as was done in [9,14]. For the linear 2-graph Laplacian, [26] used the maximum principle to establish discrete to continuum convergence rates for regression problems, and [15] used the maximum principle in combination with random walk arguments to establish convergence rates for semi-supervised learning at low labeling rates. We also mention that [43] uses the maximum principle to prove convergence rates for a reweighted version of the graph Laplacian in low label rate semi-supervised learning context.
Despite the flurry of recent work on discrete to continuum consistency results, almost none of the results apply to problems on directed graphs, which are important and widely used in practice. The only results we are aware of for directed graphs are for k-nearest neighbor graphs [24,48], which are directed only due to the asymmetry of the k-nearest neighbor relation. Discrete to continuum results are important for providing insights and further understanding of algorithms. Furthermore, continuum limits allow us to prove stability of graph-based algorithms, showing that they are insensitive to the particular realization of the data, and often can lead to new formulations of learning problems founded on stronger theoretical principles. This paper aims to start filling this void by studying consistency results for problems on directed graphs. We propose the random directed geometric graph model, which extends the random geometric graph in a natural way by adding directionality in the weights. For concreteness, we study the PageRank problem, and prove that the PageRank vector converges in the large sample size limit to the solution of a continuum, possibly degenerate, elliptic PDE. Depending on the strength of the directionality in the weights, the continuum PDE can be a first order equation, which is a new type of result for consistency of graph Laplacians. Our main results are finite sample size error estimates with high probability, which imply convergence in the continuum, but are stronger in that they hold in the non-asymptotic regime. We use these results to prove stability of the PageRank problem, and we also study the time-dependent version of the problem, which examines the continuum limit of the distribution of the random surfer. Our proofs use pointwise consistency and the maximum principle, with appropriate adaptations to directed graphs.

Setup and main results
We now describe our setup and main results. Section 2.1 introduces our random directed geometric graph model, and Section 2.2 formulates the PageRank problem in a new way and gives our main results.
2.1. Random directed geometric graph. In order to study continuum limits for problems on directed graphs, we propose a new model for a random directed graph that we call a random directed geometric graph. Let x 1 , x 2 , . . . , x n be an i.i.d. sample of size n on the Torus T d = R d /Z d with a density function ρ : T d → [0, ∞). We define the weight ω xy from x to y by The parameter h > 0 is the bandwidth of the kernel, and ε > 0 is the strength of the directionality. The kernel function Φ is assumed to be nonnegative with compact support. When B = I and b = 0 or ε = 0, the weights are the same as those of a random geometric graph, which is symmetric. For other choices of B and b, the corresponding graph is directed, with directional influence along the vector field b. The matrix B can be viewed as changing the metric locally. Assume for the moment that Φ has compact support in [0, 2]. We observe that for fixed x, the support of the weight y → ω yx is the ellipse shaped region which depends on x. Figure 1 gives us a sense of E x in two dimensions. In the random walk (or random surfer) interpretation, the random walker moves from x to a point in the set E x , which contains a drift term εb(x) and an anisotropic diffusion governed by B.
In the following remarks, we provide an in-depth motivation for the weights in the context of ranking players in a sports tournament and modeling systematic distortion in the data [44].
Remark 2.1 (Motivation via ranking). Assume each team or player is represented by a feature vector x ∈ R d that adequately describes each player. When player x and player y play against each other, we write x ≻ y if x wins the game, and y ≻ x if y wins. We assume each time x and y play, x wins with probability P(x ≻ y) and y wins with probability P(y ≻ x) = 1 − P(x ≻ y).
Suppose that x and y play n games, and we assign an edge in our graph from x to y if y wins more than half of the games, and assign the edge from y to x if x wins more than half the games. That is, the edge is directed toward the "better" player. The weight on the edge is the excess number of wins for the winning player. In expectation, if P(x ≻ y) ≥ 1 2 , then we have an edge from y to x with weight If P(y ≻ x) ≥ 1 2 then we have an edge from x to y with weight Now, we make a modeling assumption on P(x ≻ y). We assume there is some underlying (unknown) ranking function leading to the weight when ϕ(x) ≥ ϕ(y). Using a Taylor expansion, we have an edge from y to x if If ϕ is convex, the set of x satisfying the inequality above lie in an ellipse. This gives some motivation for the directional preference b = ∇ϕ and for the elliptical shape governed by B = (∇ 2 ϕ) 1/2 as they occur in the definition of our weights. We restrict the weights locally to some ball B(y, 2h) based on the assumption that teams play against similarly ranked teams in a tournament.
Remark 2.2. The work in [44] constructs two operators that can identify the common structures and the differences, respectively, between two diffeomorphic Riemannian manifolds. An application to identifying signals from fetal electrocardiogram (EGC) data via observed maternal ECG data is considered. Their algorithm handles cases where there is a systematic diffusion in the observed vs. target data; our problem is "adjacent" in the sense that we model the diffusion and directional preferences via B(y) and b(y) in our setup.

2.2.
Main results. We now present our setup and main results. We take the random directed geometric graph model from Section 2.1 with B(x) ≡ I (though see Section A for a discussion of how the results change when B is not the identity). That is, let x 1 , x 2 , . . . , x n be an i.i.d. sample of size n on the torus T d with probability density ρ : T d → [ρ min , ∞), where ρ min > 0, and set X n = {x 1 , x 2 , . . . , x n }. We define the weight from x to y by and the degree of x by d n (x) = y∈Xn ω n (x, y). We assume the kernel Φ is smooth, nonnegative, nonincreasing, satisfies Φ(0) > 0 and As constructed, for example in [47], the probability of a random walk on the graph transitioning from x to y is p xy = d n (x) −1 ω n (x, y). Plugging this into (1.2) and denoting the PageRank vector by r n : X n → R, we find that r n satisfies the linear system where α ∈ (0, 1] is the teleportation probability and v(x) is the teleportation probability distribution. To simplify the problem, we consider the normalized PageRank vector The degree d n (x) is the most basic measure of the importance of a node in a graph, and the normalized PageRank vector factors out the direct dependence on the degree to give an understanding of the additional geometric structure uncovered by PageRank.
We easily see that the normalized PageRank vector u n : X n → R satisfies the equation We note that (2.7) is considerably simpler to analyze than (2.5) since the degree term d n (y) does not appear inside the summation. We rewrite this equation by defining the PageRank Operator Then (2.7) can be written as where γ = (1−α)/α. We note that when the graph is symmetric, the PageRank Operator is exactly the random walk graph Laplacian. The corresponding problem in the continuum is the, possibly degenerate, elliptic PDE We also denote Our first main result is the following continuum limit.
and η < 1. Let u n be the solution to the PageRank problem (2.9) and let u ∈ C 3 (T d ) be the solution to the PDE (2.10). Then there exists Remark 2.4. We remark that when γ h > 0 and ηγ ε < 1, it is a standard result in elliptic PDEs that (2.10) has a unique solution u ∈ C 3,α (T d ). We refer the reader to [21,29] for more details.
When γ h = 0 or γ h > 0 is small, the continuum PDE (2.10) is dominated by the first order terms, and is better approximated by the first order equation We state the first order continuum limit as a separate result.
. Let u n be the solution to the PageRank problem (2.9) and let u ∈ C 0,1 (T d ) be the viscosity solution of the PDE (2.14). Then there exists C 1 , C 2 , c 1 > 0 such that Remark 2.6. When ηγ ε < 1, it is a standard result in viscosity solution theory that (2.14) has a unique viscosity solution u ∈ C(T d ). We prove in Lemma 4.3 that when We refer the reader to [10,18] for more details on viscosity solutions. Remark 2.7. We can analyze the characteristics of the first order equation 2.14 to understand how the PageRank algorithm propagates information on the directed graph. The characteristic ODEs [21] are where x(s) is the projected characteristic curve, z(s) = u(x(s)), and p(s) = ∇u(x(s)). Hence, information is propagated along the integral curves of the vector field b, which represents the directional influence in the random directed geometric graph.
Remark 2.8. We note that the continuum PDE (2.10) has reaction, advection, and diffusion terms. The two reaction terms, u and ρ −1 v, are due to the teleportation step in PageRank. The term div (ρ 2 bu) is an advection term, which describes the advection of the quantity ρ 2 u along the vector field b, and is due to the directional preference in the definition of the weights in a random directed geometric graph. Finally, the weighted diffusion term div (ρ 2 ∇u) represents diffusion from the random walk step of PageRank. Remark 2.9. Theorems 2.3 and 2.5 can easily be rewritten in terms of the true PageRank vector r n (x). Due to Lemma 3.3 we have in the context of Theorem 2.3, and in the context of Theorem 2.5.
Remark 2.10. Theorems 2.3 and 2.5 are stated as finite sample size results, where n, ε, h, α, and λ are fixed. If we consider the continuum limit as n → ∞ and ε n , h n , α n , λ n → 0, then Theorems 2.3 and 2.5 inform us about how to relatively scale the parameters. We always assume ε n ≤ α n and h 2 n ≤ α n , so that γ εn , γ hn ≤ 1. Thus, provided that we may apply the Borel-Cantelli Lemma to conclude that the rates hold almost surely as n → ∞. The scaling in (2.17) is a standard scaling for pointwise consistency of graph Laplacians. In this case, we have convergence rates of O(λ n + ε n + h n ) in Theorem 2.3 and O( √ λ n + ε n + h n ) in Theorem 2.5 with probability one. Theorem 2.3 shows that if we scale ε n ∼ α n and h n ∼ √ α n , then the directional preference along the vector field b is balanced with the diffusion terms, and the limiting PDE (2.10) is second order. If ε n ≪ α n , then the directional preference is negligible in the limit, and the first order terms in (2.10) disappear in the limit. Theorem 2.5 shows that if h n ≪ √ α n , then the directional preference term dominates and the diffusion term is negligible, and the continuum PDE reduces to a first order equation (2.14).
Remark 2.11. The PageRank vector r n satisfies as can be easily checked by summing both sides of (2.5). This forces r n to be a probability distribution provided v is as well. The normalized PageRank vector u n satisfies In the continuum, the solution u of (2.10) or (2.14) satisfies the analogous continuum version which can be verified by multiplying both sides of (2.10) or (2.14) by ρ 2 and integrating by parts.
Remark 2.12. While the original PageRank problem is an eigenvector problem, the PageRank vector is an eigenvector of a probability transition matrix and not an eigenvector of a graph Laplacian. Thus, we cannot use the spectral properties of the graph Laplacian proven, for example, in [13,25,42] to address the eigenvalue problem (1.1). In fact, since the probability transition matrix becomes localized as h, ε → 0, we lose the interpretation of PageRank as an eigenvector problem in the continuum. The localization of the probability transition matrix is exactly what leads to a equation with a Laplacian in the continuum. A very simple analogue is the averaging operator A function u satisfying T ε u = u is an eigenfunction of T ε with eigenvalue λ = 1. In PDE-theory, the equation T ε u = u is called the mean-value property, and is satisfied by any harmonic function u. One can easily check that for any smooth function u, where C depends only on d. Hence, as ε → 0, eigenfunctions of T ε are expected to converge to harmonic functions, which are solutions of Laplace's equation ∆u = 0. The operator T ε localizes and becomes trivial as ε → 0, since it reduces to pointwise evaluation T 0 u(x) = u(x). Thus, there is no meaningful way to think of harmonic functions as eigenfunctions of T 0 . An analogous, but more complicated, phenomenon occurs with the PageRank problem, as it also becomes localized in the continuum limit.
As an immediate application of Theorems 2.3 and 2.5 we prove asymptotic Lipschitz regularity of the PageRank vector, which shows that the ranking does not vary rapidly in feature space.
for all x, y ∈ X n .
We estimate the first and third term with Theorem 2.3, while the second is estimated by Lipschitzness of u.
Remark 2.14. Corollary 2.13 proves that u n is approximately Lipschitz continuous, with jumps of size no larger than O(λ + ε + h). We note that an analogous result to Corollary 2.13 can be stated under the assumptions of Theorem 2.5 as well.
We conclude this section by presenting an analagous continuum limit result for the evolution of the probability distribution of the random surfer r k . Similar to Eq. (1.1), the probability distribution r k of the random surfer satisfies the evolution equation Since r k is a probability distribution, so 1 T r k = 1, we can also write the equation as As before, we denote by r n (x, k) the x-component of r k ; that is, r n (x, k) is the probability of finding the random surfer at vertex x after k steps on the random directed geometric graph of size n. Plugging this into (2.20) we find that r n (x, k) satisfies r n (x, k + 1) = (1 − α) y∈Xn ω n (y, x) d n (y) r n (y, k) + αv(x) for all x ∈ X n , (2. 21) where v(x) is the teleportation probability distribution. As before, we simplify the problem by defining the normalized distribution u n (x, k) := nh d dn(x) r n (x, k) and find that u n (x, k) satisfies For the initial condition we take u n (x, 0) = g(x) for some given smooth function g. We can think of (2.22) as a discrete heat equation on the graph, describing the evolution of the normalized distribution u n of the random surfer. The stationary point of the evolution, as k → ∞, is clearly the solution of the PageRank problem (2.9). The continuum version of (2.22) is the reaction-advection-diffusion equation This is verified by the following continuum limit result.
, and g ∈ C 3 (T d ) for any 0 < α < 1. Assume that γ ε ≤ 1, 0 < γ h ≤ 1, and η < 1. Let u n (x, k) be the solution of (2.22) satisfying u n (x, 0) = g(x), and let u ∈ C 3 (T d ) be the solution to the PDE (2.23). Then there exists C 1 , C 2 , c 1 , c 2 > 0 with C 1 depending on γ h > 0, such that when ε + h ≤ c 1 (1 − ηγ ε ) and 0 < λ ≤ 1, the event that holds for all k ≥ 0 has probability at least Theorem 2.15 shows that the distribution of the random surfer can be approximated by the continuum PDE (2.23). The error estimates depend on λ, ε and h in a similar way as in Theorem 2.3. The main difference is the appearance of the term kα, which corresponds to the time parameter in the continuum PDE (2.23), and is due to the accumulation of pointwise consistency errors over k steps. Remark 2.17. We mention that another interesting perspective is the inverse problem of using graph-based numerical schemes, like the PageRank scheme, to numerically solve the continuum reaction-advection-diffusion equations. Modulo technical details, all of the results in this paper can be extended to the manifold setting, where T d is replaced by a smooth compact and connected manifold M of dimension d embedded in R D where d < D. Since graph-based numerical schemes learn the geometry of the manifold automatically, they may provide convient numerical methods for solving PDEs on manifolds. We mention that ideas along these lines were mentioned in [8] for approximating the spectrum of the Laplace-Beltrami operator on a manifold of dimension higher than 2 or 3, where finite-element methods become cumbersome. This is an interesesting direction to explore in future research.

2.3.
Outline. The paper will be organized as follows. In Section 3 we prove pointwise consistency with high probability for the PageRank operator L n on a random directed geometric graph. This includes pointwise consistency to both first and second order continuum operators. In Section 4 we prove our main results, Theorems 2.3, 2.5, and 2.15.

Consistency for L n
In this section, we prove pointwise consistency for the operator L n with both first and second order continuum operators. Throughout this section and the rest of the paper, we write ω xy = ω n (x, y) and d x = d n (x) for simplicity.
3.1. Concentration of measure and change of variables. We first recall a concentration inequality proved in [9] as an application of Bernstein's inequality.
When we apply the lemma to the degree d y , we need to compute the expected value of d y , which is the integral For this computation, and others, we require asymptotic expansions in the change of variables formulas, which is provided by the following result.
Lemma 3.2 (Change of Variables). Let g : R d → R be continuous. Then we have for sufficiently small ε > 0.
Proof. The proof is straightforward when we integrate over x. In the case where we integrate over y, observe that Assuming that ε is small enough such that εDb(y) has eigenvalues with magnitude strictly less than 1, Therefore, For dy, the change of variables theorem tells us where | · | represents the determinant function.
Finally, choosing ε small enough for |εdivb(y) + O(ε 2 )| < 1 2 and taking the Taylor expansion of b(y) at x yields which we plug into the expression for dy to complete the proof.

Pointwise consistency.
We now turn to the main pointwise consistency results. Our first is a standard result for the degree.
with probability at least 1 − 2 exp(−cλ 2 nh d ), where c is a constant independent of n.
Proof. We use part (i) in Proposition 3.2 to compute By Lemma 3.1, we see that for any 0 ≤ λ ≤ 1. Combining the observations above completes the proof.
We now state and prove the consistency result.
Theorem 3.4 (Consistency for the 2 nd order PageRank Operator). There exists constants C, c > 0 such that the event that holds for all ϕ ∈ C 2,1 (T d ) and x ∈ X n has probability at least 1 − Cn exp(−cnh d λ 2 ).
Proof. Fix x ∈ T d , take ϕ ∈ C 2,1 (R d ) to be a test function, and let p = Dϕ(x) and a ij = ϕ x i x j (x). We apply the operator L n to ϕ at x and take a second-order Taylor expansion at x of the ϕ(y) inside the summation, which gives us where β = ϕ C 2,1 (T d ) , and the ε 3 + h 3 in the remainder term comes from the scaling of |y − x|. Observe that since ϕ(x) and its derivatives are factored out from the summation, the result will hold uniformly for all smooth test functions. Noting that |ω xy − ω yx | ≤ Ch −1 ε, we have by Lemma 3.1 that each of hold with probability at least 1 − 2 exp −cnh d λ 2 for any 0 < λ ≤ 1. Combining the observations above we have with probability at least 1 − C exp −cnh d λ 2 that We now compute asymptotic expansions for all the terms in (3.5). By Lemma 3.2 we We now compute By Lemma 3.2 we have By Lemma 3.2 again we have Finally, another application of Lemma 3.2 yields where δ ij = 1 if i = j and δ ij = 0 otherwise. Combining (3.6), (3.7), and (3.8) with (3.5) we have We now divide both sides of (3.9) by ρ(x) and use the identities to establish (3.4) for a fixed x ∈ T d . To complete the proof, condition on x ∈ X n and union bound over X n .
We also have a corresponding consistency result when the continuum PDE is first order.
Theorem 3.5 (Consistency for the 1 st order PageRank Operator). There exists constants C, c > 0 such that the event that holds for all ϕ ∈ C 1,1 (T d ) and x ∈ X n has probability at least 1 − Cn exp(−cnh d λ 2 ).
Proof. The proof follows closely to that of Theorem 3.4, so we sketch it here. Fix x ∈ T d , take ϕ ∈ C 1,1 (R d ) to be a test function, and let p = Dϕ(x). We have where β = ϕ C 1,1 (T d ) . Thus, with probability at least 1 − C exp −cnh d λ 2 we have that By (3.6) and (3.7) we have Divide both sides by ρ(x) and use the identity to complete the proof.
The proof of the other direction is immediate by replacing u with −u and v with −v.
Given the stability estimate from Lemma 4.1, we can now prove Theorem 2.3.
Proof of Theorem 2.3. Given the assumptions on ρ, b, and v, the solution u of the continuum PDE (2.10) belongs to C 3 (T d ), and u C 3 (T d ) depends on the ellipticity constant of the equation σ Φ γ h [21,29]. Thus, applying Lemma 3.3, and Theorem 3.
for all x ∈ X n with probability at least 1 − Cn exp −cnh d+2 δ 2 for any 0 < δ ≤ h −1 , where we used that γ ε , γ h ≤ 1. Therefore for all x ∈ X n with probability at least 1 − Cn exp −cnh d+2 δ 2 ) . Consider now the difference w n (x) = u(x) − u n (x), where u n solves the PageRank problem (2.9). Then w n satisfies for all x ∈ X n with probability at least 1 − Cn exp −cnh d+2 δ 2 . Applying Lemma 4.1 completes the proof.
We now give the proof of Theorem 2.15.
Proof of Theorem 2.15. The proof is similar to the proof of Theorem 2.3, so we sketch the outline, omitting some details. First, note that Proceeding now as in the proof of Theorem 2.3, we use Theorem 3.4, Lemma 3.3, and the observation above to deduce that with probability at least 1 − Cn exp(−cnh d+2 λ 2 ) for 0 < λ ≤ 1, where we also used that γ ε ≤ 1. Let ϕ(x) = 1 for all x ∈ X n . As in the proof of Lemma 4.1 we have that with probability at least 1 − Cn exp(−cnh d+2 δ 2 ), where 0 < δ ≤ 1 will be chosen later.
For the rest of the proof we assume the events above hold true. Define w n (x, k) = u(x, αk) − u n (x, k). We claim that The proof of the other direction is similar, and this will complete the proof. To prove (4.6), we use a comparision principle argument that proceeds by induction. The base case k = 0 is trivial, since w n (x, 0) = 0. Assume that (4.6) is true for some k ≥ 0. Then subtracting (4.4) and (2.22) we find that w n satisfies for all x ∈ X n and k ≥ 0. Rearranging this we obtain w n (x, k + 1) ≤ (1 − α) (w n (x, k) + L n w n (x, k)) + Cα(λ + ε + h) since w n (x, k) ≤ M k for all x ∈ X n . Applying (4.5) we have Hence, choosing δ = (1 − ηγ ε )/(4C) and restricting h + ε ≤ (1 − ηγ ε )/(4C), we have The claim (4.6) is thus established by induction, and this completes the proof.
We now turn our attention to proving the first-order rate, which hinges on the following observation that our scheme is monotone.
Proof. The proof is immediate, since In order to prove a convergence rate for the first order continuum limit, we require a Lipschitz estimate on the viscosity solution u of (2.14). The result follows a standard maximum principle argument, which we include for completeness.
We now prove that the Lipschitz constant of u δ is controlled independently of δ > 0. The argument is standard in elliptic PDEs, and we include it for completeness. We write c(x) = 1 + ρ −2 div (ρ 2 b) for convenience, and note we have c(x) ≥ 1 − ηγ ε . By the maximum principle we have To bound the gradient, we differentiate both sides of (4.7) in x i , then multiply both sides by u x i , and sum over i to obtain We use the identity on the Torus T d . Now, let x 0 ∈ T d be a point where |∇u δ | 2 attains its maximum value on T d . Then we have ∇|∇u δ (x 0 )| 2 = 0 and ∆|∇u δ (x 0 )| 2 ≤ 0, which yields where C depends on ρ, c, and v. Since Db L ∞ (T d ) ≤ (1 − ηγ ε )/2, we can apply Cauchy's inequality with ε to obtain which completes the proof.
We now give the proof of Theorem 2.5. The proof follows the method of doubling the variables, which is used in viscosity solution theory for proving the comparison principle and establishing error estimates [10,18]. For the reader's convenience, we review the definition of viscosity solution in Section B.
Proof of Theorem 2.5. First, by Lemma 4.1 we have that u n is uniformly bounded, independent of n, with probability at least 1 − Cn exp −cnh d+2 (1 − ηγ ε ) 2 provided ε + h is sufficiently small. We assume these conditions throughout the rest of the proof.
Define the doubling-of-variables function which has a maximum at (x n , y n ) ∈ X n × T d . Since Φ(x n , y n ) ≥ Φ(x n , x n ), we have By Lemma 4.3, u is Lipschitz and so θ 2 |x n − y n | 2 ≤ u(x n ) − u(y n ) ≤ C|x n − y n |.
Hence we have the bound |x n − y n | ≤ C θ .
Define ψ(x) = θ 2 |x − y n | 2 and ξ n = u n (x n ) − θ 2 |x n − y n | 2 . By the definition of (x n , y n ), u n − ψ has a maximum at x n relative to X n . Since we have u n (x n ) = ψ(x n ) + ξ n we see that u n ≤ ψ + ξ n on X n and so by Proposition 4.2 we have L n u n (x n ) ≤ L n (ψ + ξ n )(x n ).
The proof of the bound in the other direction on u − u n is analogous, except we use the auxiliary function Appendix A. A more general operator In this section, we consider the more general case where the matrix B in the definition of the weights for a random directed geometric graph (see Section 2.1 for definitions) is not the identity matrix. This case turns out to be difficult to interpret, and so we omit the details of extending our main results to this setting, but include the discussion below for completeness.
Assume that B(x) ∈ R d×d has bounded, Lipschitz continuous entries and has full rank for all x ∈ T d . Then, following the arguments of the previous subsection, the continuum limit operator is where the first order term is − γ h Φ(|z|) (∇u(x) · f (x, z)) (∇ log ρ(x) · f (x, z)) dz To clarify the notation, we use DB(x) to represent the tensor where the ijth term is ∇B ij (x). For instance, the ijth component of the term DB(x)B(x) −1 z is the dot product of the gradient of the ijth component of B with the vector B(x) −1 z, i.e., ∇B ij (x) · B(x) −1 z.
In the zeroth order term, we use H Φ to represent the integral Due to the presence of DB(x), the terms in F , f , and H Φ will vanish when B is a constant matrix, such as the identity.

Appendix B. Definition of viscosity solution
Viscosity solutions are a notion of weak solution for PDEs based on the maximum principle. Viscosity solutions enjoy very strong stability and uniqueness properties, and the theory is especially useful for passing from discrete to continuum limits (see, e.g., [10,12,16]). We review here the basic definitions of viscosity solutions of the first order equation where Ω ⊂ R d is open. For viscosity solution on the Torus T d , we take Ω = R n and treat functions on T d as Z d -periodic functions on R d for defining viscosity solutions.
Definition B.1. We say u ∈ C(Ω) is a viscosity subsolution of (B.1) if for all x ∈ Ω and every ϕ ∈ C ∞ (R n ) such that u − ϕ has a local maximum at x we have H(x, u(x), ∇ϕ(x)) ≤ 0.
Likewise, we say that u ∈ C(U ) is a viscosity supersolution of (B.1) if for all x ∈ Ω and every ϕ ∈ C ∞ (R n ) such that u − ϕ has a local minimum at x we have H(x, u(x), ∇ϕ(x)) ≥ 0.
Finally, we say that u is a viscosity solution of (B.1) if u is both a viscosity sub-and supersolution.