ON THE CONVERGENCE RATE OF POTENTIALS OF BRENIER MAPS

The theory of optimal transportation has experienced a sharp increase in interest in many areas of economic research such as optimal matching theory and econometric identification. A particularly valuable tool, due to its convenient representation as the gradient of a convex function, has been the Brenier map: the matching obtained as the optimizer of the Monge–Kantorovich optimal transportation problem with the euclidean distance as the cost function. Despite its popularity, the statistical properties of the Brenier map have yet to be fully established, which impedes its practical use for estimation and inference. This article takes a first step in this direction by deriving a convergence rate for the simple plug-in estimator of the potential of the Brenier map via the semi-dual Monge–Kantorovich problem. Relying on classical results for the convergence of smoothed empirical processes, it is shown that this plug-in estimator converges in standard deviation to its population counterpart under the minimax rate of convergence of kernel density estimators if one of the probability measures satisfies the Poincaré inequality. Under a normalization of the potential, the result extends to convergence in the 
$L^2$
 norm, while the Poincaré inequality is automatically satisfied. The main mathematical contribution of this article is an analysis of the second variation of the semi-dual Monge–Kantorovich problem, which is of independent interest.

Despite its many uses and applications, the large sample properties of the classical Monge-Kantorovich optimal transport problem have yet to be fully established in the statistical literature. Recently, Sommerfeld and Munk (2018) proved asymptotic normality of its value function in the case where the respective probability measures have finite support; in a seminal contribution, del Barrio and Loubes (2019) proved central limit theorems for the value function for general probability measures under a weak moment condition using the Efron-Stein inequality. These results are important, as the value functions of the Monge-Kantorovich problem under L p -cost functions are the Wasserstein distances on the space of signed Borel measures, which metrize weak convergence plus convergence in pth mean (Santambrogio, 2015, Thm. 5.11). However, especially in economic applications, one is often rather interested in the optimizer of these problems than the value function: the optimizer is the induced "optimal matching" between the two probability measures for the respective cost function. This optimal matching induces copulas which can potentially be useful in many multivariate settings, for instance, in the characterization of higher-order Markov processes (Ibragimov, 2009). In this respect, the Brenier map, the minimizer of the Monge-Kantorovich problem under the euclidean distance as a cost function, has received the most attention, because it possesses many interesting properties as an optimal matching, like monotonicity. In fact, Galichon (2016, p. 64) states that establishing the large sample properties of (the potential function of) the Brenier map is an important open problem. This paper takes a first step to analyze this problem: by deriving and analyzing the second variation of the semi-dual Monge-Kantorovich problem, we derive a convergence rate of the simple plug-in estimator of the potential function of the Brenier map for smoothed empirical measures (i.e., integrated kernel density estimators) in the sense of Yukich (1992) and van der Vaart (1994). 1 Brenier (1991) showed that the Brenier map in this setting takes the form of the gradient Dϕ of a convex function ϕ. We consider the Kantorovich potential function ϕ of the Brenier map an M-estimator and show that a natural sample counterpartφ n converges to ϕ in standard deviation on every compact subset of the interior of its support under appropriate smoothness assumptions on the densities. Without a normalization on ϕ, at least one of the two probability measures between which the Brenier map acts has to satisfy Poincaré's inequality for this result to hold. However, under a normalization of ϕ that fixes its average value to zero (Lee, 2018), the convergence holds in the L 2 norm without requiring Poincaré's inequality. This complements the consistency proof for (the potential function of) the Brenier map established in Chernozhukov et al. (2017).
Since the initial submission of this article, Hütter and Rigollet (2019) have derived the minimax rate of convergence for the Brenier map by constructing a theoretical wavelet estimator. We find that the rate of convergence we obtain for the potential function corresponds to the minimax rate of the kernel density estimators for the respective measures. This rate is slightly suboptimal compared to the minimax rate of the Brenier map derived in Hütter and Rigollet (2019). The reason for this is not our derived results of the semi-dual Monge-Kantorovich problem but is due to the fact that we rely on classical results for the rate of convergence of smoothed empirical processes in Giné and Nickl (2008) and Radulović and Wegkamp (2000). These results derive the stochastic equicontinuity of the M-estimator in question via an approximation of the smoothed empirical process by the standard empirical process. It turns out that this imposes too strong restrictions on the bandwidth h n for our problem, and in fact ignores some of the additional regularity of the Brenier map, which makes our rate of convergence suboptimal.
Despite this, the current article shows that the simple plug-in estimator of the semi-dual Monge-Kantorovich problem, which is easier to implement and conceptually simpler than the theoretical wavelet estimator from Hütter and Rigollet (2019), performs well, and possesses enough regularity to potentially reach the minimax rate of convergence. In fact, as the main mathematical result of this article, we show that the second variation of the semi-dual Monge-Kantorovich problem takes the form of a Dirichlet energy functional weighted by the density function of the source measure. This problem has the same regularity properties in multiple dimensions as the infinitesimal generators of ergodic diffusions, which has been shown to be higher than the regularity of classical smoothed empirical processes by the seminal result Dalalyan and Reiß (2007, Prop. 1), see the analysis in Rohde and Strauch (2010). This additional regularity also exactly coincides with the additional regularity found in Hütter and Rigollet (2019). This strongly suggests that an application of these results in place of the classical results for smoothed empirical processes can lead to fewer restrictions on the admissible bandwidth which would imply the minimax rate of convergence found in Hütter and Rigollet (2019) for the simple plug-in estimator, without changing the estimator or mathematical results of this paper.
Our focus is to examine the rate of convergence of the potential function in general dimension d ≥ 2, i.e., when the probability measures between which the Brenier map acts are supported in R d . The one-dimensional case is simpler and has already been solved, showing that the Brenier map converges at the parametric rate in this case; in the econometrics literature, this has been done-without mentioning the connection to optimal transport theory-in Athey and Imbens (2006) for instance. The reason for why the one-dimensional case is easier to handle lies in the fact that the Brenier map between the distributions F X and F Y has a closed form expression as the monotone rearrangement Dϕ(x) := F −1 Y (F X (x)); this is not true in higher dimensions, which is why one has to resort to more general tools in order to tackle this question. In particular, our main mathematical result is Lemma 1, which provides the regularity conditions of the first and second variation of the semi-dual problem of the Monge-Kantorovich problem in our setting.
The structure of this article is as follows: Section 2 contains all theoretical results of this article. We state the lemma about the analysis of the second variation of the semi-dual Monge-Kantorovich problem in Section 2.1 and our assumptions in Section 2.2. Section 2.3 introduces the results for the rate of convergence of the kernel density estimator of the semi-dual of the optimal transport problem with and without normalization. We also state a simple suboptimal rate of convergence for the Brenier map based on the bandwidth used for the potential function. Section 2.4 contains a brief discussion about computational issues. We conclude in Section 3. Appendixes A and B contain a brief review of the Monge-Kantorovich problem and all proofs, including additional lemmas.

CONVERGENCE RATES
In this section, we state the rate of convergence of the natural plug-in estimators of ϕ and Dϕ by considering the semi-dual problem to the Monge-Kantorovich problem, which is the dual to the optimal transport problem with the cost function c(x,y) := − x,y : where C(X ) is the space of all bounded continuous functions on the support X of P X , endowed with the standard supremum norm f ∞ := sup x∈X |f (x)|, and where x,y denotes the standard scalar product of vectors x,y ∈ R d . 2 Throughout, we assume that P X and P Y are absolutely continuous with respect to Lebesgue measure with densities f X and f Y , respectively. Theorem 1.3 in Villani (2003) shows that there is no duality gap for the Monge-Kantorovich problem in this setting, so that we can in fact use the dual problem to derive the rate of convergence of the estimator of ϕ. Note that the optimal solution of (1) always consists of conjugate duals, i.e., ψ = ϕ * , where the convex conjugate ϕ * of ϕ is defined as see the first step in the proof of Proposition 3.1 in Brenier (1991). Therefore, the problem reduces to simply estimating ϕ, and we can write (1) as In practice, there are many ways to estimate the Brenier map, see, for instance, Benamou and Brenier (2000), Benamou, Froese, and Oberman (2014), and Chartrand et al. (2009) among others. The latter approach introduces an infinite dimensional gradient descent method to calculate the Brenier map, using ideas from the calculus of variations to derive this result. We go a very similar route in this article. We use ideas from the calculus of variations in conjunction with the fact that ϕ is an M-estimator to derive a rate of convergence for its natural plug-in estimator, i.e., its smoothed sample analogφ n , which is the solution to where we use the notation from Giné and Nickl (2008) and define the standard kernel density estimator bŷ h n denotes a product smoothing kernel, h n denotes the bandwidth, which for ease of notation we assume to be the same for all dimensions, and f * g denotes the convolution between two functions f and g. The same notation holds for P Y n * K h n (y). n denotes the number of observations of the sample throughout. 3 For this rate of convergence result to be applicable in practice, we must assume that computational implementation of (3) delivers a strictly convex solutionφ n . This can be achieved by introducing a strict convexity penalty term that enforces convexity ofφ n for all iterations of the algorithm. The penalty term can usually be chosen such that it gradually disappears with the number of iterations, so that it is not binding at the solution. We found that this type of regularization prevents algorithms like the one in Chartrand et al. (2009) from overfitting and hence makes the program more robust, which is another reason to use this in implementations, see, for instance, Gunsilius and Schennach (2019).

Main Lemma
In order to derive the rate of convergence ofφ n , we need to make regularity assumptions on the densities f X and f Y as well as the kernel K X i −x h n and the bandwidth h n . In addition, we make use of Caffarelli's regularity theory for optimal transport maps, in particular the a priori interior estimates established in Caffarelli (1990) for which we have to make further regularity assumptions. Throughout, we will be working in Hölder spaces. Let k := (k 1 ,...,k d ) be a multi-index of nonnegative integers k 1 , . . . ,k d . Set |k| = d i=1 k i and write We define the Hölder norm f X C s,α for s ∈ N 0 and α ∈ (0,1] of the Hölder space C s,α (X ) as where · stands again for the euclidean norm in R d . In this sense, a function f X ∈ C s,α (X ) if there exists a constant c < +∞ such that f X C s,α ≤ c. Based on this, we say that a function f X is locally Hölder continuous on X if it is Hölder continuous on every compact subset of the interior of X , which we denote by f ∈ C s,α loc (X • ). 4 The main mathematical result of this article upon which all statistical derivations rely is the following lemma, which provides the regularity conditions of the first and second variation of (2). It might be of independent interest. 5 LEMMA 1. Let P X and P Y be absolutely continuous measures with respect to Lebesgue measure with densities f X and f Y on convex and compact supports X and Y. Then, the functional M is convex, Lipschitz-continuous, and Hadamard differentiable on C(X ). The first variation δM ϕ (v) of M(ϕ) on C(X ) is Based on this, the Hadamard differential takes the form If we restrict M(·) to C 2 (X • ), then there exists a neighborhood N ϕ 0 ⊂ C 2 (K) around the strictly convex ϕ 0 such that M (ϕ) coincides with the Fréchet derivative on N ϕ 0 , where K ⊂ X • is an arbitrary compact set. Moreover, the second variation δ 2 M ϕ (v,u) is continuous in ϕ on N ϕ 0 and a bounded bilinear functional in u,v ∈ C 2 (X ). In particular, for u ≡ v, it holds that and, for every ε > 0 and convex compact K ⊂ X • , there exists an η(ε) > 0 such that Lemma 1 is needed for a Taylor-expansion argument. The apparently novel result is the derivation of the formula for the second variation in all directions C 1 (X ), along with its properties. 6

Assumptions
We can now state the required assumptions for the statistical result.
Assumption 1 (Regularity of the densities) The densities f X and f Y have the following properties: (i) The supports X,Y ⊂ R d of f X and f Y are compact and convex.
The assumptions on the support of the densities are required for Caffarelli's interior regularity theory (Caffarelli, 1990), see Lemma 2 in Appendix B.2. The boundedness assumption of the density will make the Monge-Ampère operator associated with the Monge-Kantorovich problem (2) an elliptic operator. Ellipticity is important, as it allows us to prove convergence ofφ n to ϕ 0 in Hölder norm using standard Schauder estimates of elliptic second-order partial differential equations (PDEs; Gilbarg and Trudinger, 1998, Chap. 6). See Lemma 3 in Appendix B.2.
Assumption 2 (Order of kernels and bandwidths) The kernel K is nonnegative and of second order. The bandwidths h n ,h n > 0 satisfy h 2 n n 1/2 → 0 and (h n ) 2 n 1/2 → 0 as n → +∞.
Assumption 2 is a standard assumption on the bandwidths and the kernel, taken directly from Theorem 6 in Giné and Nickl (2008), which provides the asymptotic distribution result for smoothed empirical processes indexed by univariate Hölder functions. In particular, the upper bound on the bandwidth makes the bias of the smoothed empirical process asymptotically negligible as proved in Theorem 6 of Giné and Nickl (2008). We adapt their result to the multivariate setting in a straightforward way Giné and Nickl (2008, p. 369). We are more restrictive than this result by requiring nonnegative kernels, as they allow us to relate bracketing numbers of the empirical processes to their smoothed empirical counterparts. It is possible to allow for higher-order kernels like the ones in Giné and Nickl (2008) if one replaces our bracketing entropy approach by uniform entropy under some formal complications. Note also that we do not require a lower bound on the amount of smoothing needed. This is because the potential ϕ will always be in a Donsker class, so that the empirical process will converge for this class even without smoothing (Giné and Nickl, 2008, p. 344).
Without a normalization on the potential ϕ, we need to uphold another assumption on P X . This assumption will make the objective function of the dual of the Kantorovich problem well-separated at the optimal ϕ 0 in an appropriately chosen semi-metric. In fact, we require that P X satisfy the Poincaré inequality, for any v : where is the variance of v with respect to the measure P X , Dv(x) 2 is the squared euclidean norm of the gradient of v(x), and c > 0 is some constant.

Assumption 3
The probability measure P X satisfies the Poincaré inequality (4).

Convergence Rates for the Potential
The following is our main result.
THEOREM 1 (Rate of convergence of ϕ). Under Assumptions 1-3, it holds that for Here, ϕ 0 is the minimizer of (2) andφ n is the minimizer of (3). 7 Theorem 1 implies a curse of dimensionality of the same form as the one for the square root of the asymptotic integrated mean-square error for kernel density estimators. This is not surprising due to the fact that kernel density estimators are used to estimate the probability measures. Note that, for s + α > d/2, this is the standard minimax rate of estimation of kernel density estimators (Tsybakov, 2008, Thm. 1.1).
As mentioned in the Introduction, the rate for the case s+α > d 2 is slightly worse than the minimax rate for estimators of the Brenier map in Hütter and Rigollet (2019); indeed, a rate of the potential of n s+α 2(s+α)+d would in the best possible case correspond to a rate of convergence of n s+α 2(s+α)+2+d , because one would sacrifice "one degree of smoothness" when estimating the Brenier map directly, i.e., the gradient of the potential function. Hütter and Rigollet (2019) derive a minimax rate of n s+α 2(s+α)−2+d for the Brenier map, which is slightly faster than the potential rate achievable by the standard plug-in estimator for our bandwidth h n . The reason is that we rely on classical asymptotic results for smoothed empirical processes which do not take into account the additional regularity of the Brenier map and hence impose the strong upper bound on the bandwidth h 4 n n → 0. As for Assumption 3, the requirement that a probability measure satisfies Poincaré's inequality is a high-level requirement, and it is an active area of research linked to the concentration of measure phenomenon to establish when this requirement is satisfied. In particular, if P X satisfies the Poincaré inequality, then it satisfies an exponential concentration inequality (Ledoux, 2001, Cor. 3.2). One result establishes the reverse connection between the Poincaré inequality and a dimension-free concentration inequality (Gozlan, Roberto, and Samson, 2015, Thm. 1.2). Many important probability measures satisfy Poincaré's inequality, such as the exponential and uniform distribution, most notably the set of logconcave probability measures, which are measures possessing density functions e −V(x) for some convex and twice continuously differentiable V(x), but there are many more classes (e.g., Bakry et al. (2008)).
Intuitively, Poincaré's inequality allows us to work with the standard deviation as a semi-norm on the Hölder space defined above. This semi-norm creates an 7 P * denotes outer probability, which we use to circumvent measurability issues in the nonseparable space ∞ (C s,α loc (X • )) of all bounded functionals on the Hölder space C s,α loc (X • ).
equivalence class of functions of the form u ∼ v if and only if u = v + c, where c is a constant function on X . Under this equivalence class, the solution to the dual problem of the Monge-Kantorovich problem is unique and well-separated, which is one of two main requirements for our derivation of the rate of convergence of the plug-in estimator. 8 In general, the potential function of the Brenier map is only identified up to an additive constant, which would make the optimum not wellseparated under a different (semi-) metric.
Since the main quantity of interest in general is the Brenier map, a usual approach in the literature on optimal transport theory is to fix the potential function by fixing the coordinate system. The standard normalization for this is ϕ(0) = 0. 9 However, Lee (2018), who relies on some of the results in the present article to derive a sieve estimation procedure for optimal transport maps, introduces the following helpful and natural normalization for the potential function: i.e., fixing the mean to be zero. This normalization is rather helpful, because it makes the standard deviation coincide with the L 2 norm, so that all convergence arguments in the proof go through in the L 2 norm. Furthermore, under this normalization, we can replace the generalized Poincaré inequality with respect to the general probability measure P X by the classical Poincaré inequality with respect to Lebesgue measure. In other words, if the potential function ϕ is normalized to have zero mean and is defined on a compact support, then it automatically satisfies the Poincaré inequality with respect to P X . The classical Poincaré inequality satisfies for some constant c < +∞ and where the last inequality follows from the normalization (7). 10 Now, by Hölder's inequality and the fact that f X is bounded above by γ < +∞, it follows that Since f X is bounded above and below by γ and γ −1 and X is compact by Assumption 1, it follows that Putting equations 8-10 together implies This implies that the optimal ϕ 0 is well-separated with respect to · L 2 (P X ) , and we immediately obtain the following.
COROLLARY 1 (Rate of convergence under normalization). Under Assumptions 1 and 2, it holds that for Here, ϕ 0 is the minimizer of (2),φ n is the minimizer of (3), and where we uphold the normalization (7).
Based on the results established in the previous section, we can derive a rate of convergence for the Brenier map Dϕ by subtracting one degree of smoothness in our rate of convergence for the potential function in the case where s + α > d 2 . This is a classical result in the setting where Dϕ has enough smoothness, but requires that we change the bandwidth h n from the estimation of the potential function. Moreover, in the other two regimes, it is less clear how the rate of convergence behaves. The following proposition therefore provides a crude upper bound for the rate of convergence of the Brenier map based on the convergence rate of its potential by bounding the squared euclidean distance of gradients of convex functions by the distance of their potential functions as in the proof of Theorem 2.33 in Attouch and Wets (1986). It provides the rate of convergence of the Brenier map for the bandwidth used to estimate the potential function and is therefore by definition suboptimal compared to the rate of convergence where we are allowed to change h n . However, this result is applicable in all three different cases for the smoothness by relying on the bandwidth used for the potential function and even holds for functions ϕ that are not twice continuously differentiable, but only possess a Lipschitz-continuous first derivative.
PROPOSITION 1 (Rate of convergence of Dϕ for the bandwidth of the potential). Under Assumptions 1 and 2 and by using the optimal bandwidth h n for obtaining the rate of convergence of the potential function ϕ, it holds that for where ϕ 0 is the minimizer of (2),φ n is the minimizer of (3), and where we uphold the normalization (7).
Note that this rate is rather slow and far from optimal. In particular, it is the square root of the rate of the potential. This makes it much slower than the minimax rate obtained in Hütter and Rigollet (2019). This suboptimality should not be surprising, as this rate of convergence is based on the optimal bandwidth used for the potential function and not optimized for estimating the Brenier map directly, as mentioned. The novelty of Proposition 1 is that it provides a direct connection between the Brenier map and its potential function for a fixed bandwidth.

Practical Considerations
The rate obtained holds for the optimization problem (3). The first Hadamard derivative in Lemma 1 suggests the following simple functional gradient descent approach which was suggested in Chartrand et al. (2009): where k = 1,... denotes the iterations, α n is a step size, and M is the Hadamard derivative derived in Lemma 1. In practice, this approach can be implemented by a grid approach (Gunsilius and Schennach, 2019): place a grid on your data and estimate both densities via a kernel density estimator on the grid. Then, compute the optimal transport map by computing the gradient descent step at each point of the grid.
This approach clearly suffers from a computational curse of dimensionality, not least since it relies on kernel density estimation. The number of grid points grows exponentially with the dimension, so that this approach is only applicable in lower dimensions. Also, in all of our analysis, we do not account for the computational approximation error induced via the specific optimization procedure induced by the gradient descent algorithm, and we exclusively focus on the statistical rate of convergence. Let us also address the implementation of the normalization (7). In a practical grid-approach, it is not clear a priori how to include this normalization, as it requires knowledge of the potential function, which is the object we want to estimate. In contrast, in sieve estimation procedures like Lee (2018), one can include the normalization straightforwardly, by only focusing on sets of potentials that are normalized in this way.
Lastly, note that all results presented in this article work for the standard empirical measure, without smoothing; this also manifests itself in the fact that we do not require a lower bound on the bandwidth in Assumption 2. The simple model for this would be In this case, one needs to make the smoothness assumptions onφ n directly, as they do not follow from the kernel density estimates in finite samples. In practice, this optimization is still hard to solve, even though it is not based on a kernel density estimation. This is not just a problem in our setting, but for the computational theory of optimal transport more generally (see, for instance, the discussions in other implementations such as Benamou et al. (2014) and Seguy et al. (2017)). One way to circumvent this curse is to use the Sinkhorn divergence introduced in Cuturi (2013), which penalizes the optimal transport problem with an entropy term and can be solved efficiently via Sinkhorn iterations (Sinkhorn, 1967). The optimizer does not coincide with the optimal transport map in this case, however.
Computing the optimal transport map efficiently in practice in high dimensions is therefore still an open problem.

CONCLUSION
We have derived a convergence rate for the simple plug-in estimator of the semidual Monge-Kantorovich problem for the potential function of the Brenier map. It coincides with the minimax rate of convergence of the kernel density estimator of Hölder classes if the respective probability densities are smooth enough. The idea is to use the semi-dual problem of the Monge-Kantorovich problem and analyze the second variation of this problem for a Taylor-expansion argument. Without a normalization on the potential, this rate will be achieved for the standard deviation and requires that at least one of the measures satisfies the Poincaré inequality. Under the normalization that the first moment of the potential is zero, this convergence takes place in L 2 -norm and does not require the measure to satisfy the Poincaré inequality anymore. The obtained rate of convergence is suboptimal compared to the minimax rate of convergence recently derived via a wavelet estimator in Hütter and Rigollet (2019). This suboptimality follows from the fact that we rely on classical asymptotic results for empirical processes as derived in Giné and Nickl (2008), which requires the strict bound h 4 n → 0 on the bandwidth. The main mathematical result in this article shows, however, that the second variation of the semi-dual Monge-Kantorovich problem takes the form of a classical Dirichlet energy functional weighted by the source measure. This problem has the same regularity properties in multiple dimensions as the infinitesimal generators of ergodic diffusions, which has been shown to be higher than the regularity of classical smoothed empirical processes by the seminal result Dalalyan and Reiß (2007, Prop. 1), see the analysis in Rohde and Strauch (2010). This additional regularity also exactly coincides with the additional regularity found in Hütter and Rigollet (2019) and implies that an application of these results in place of the classical results for smoothed empirical processes can lead to fewer restrictions on the admissible bandwidth which would imply the minimax rate of convergence found in Hütter and Rigollet (2019).
The results derived in this article can potentially be used to derive the asymptotic distribution of the potential function of the Brenier map by making use of the second variation of the semi-dual Monge-Kantorovich problem derived in Lemma 1.

A. THE MONGE-KANTOROVICH PROBLEM AND THE BRENIER MAP
This section is designed to give the reader a very brief overview of the Monge-Kantorovich problem and the Brenier map. For more information, we refer to the introductory text Villani (2003).
The goal of the theory of optimal transport is to analyze maps T which transport one probability measure P X onto another probability measure P Y in a "cost-efficient way." The setup for this is the Monge-Kantorovich problem. To be precise, the Monge and Kantorovich problems are actually two different problems, the latter being the convex relaxation of the former. Monge's problem asks for an optimal transport map between two (probability) measures, P X and P Y defined on their supports X and Y, where optimality is measured with respect to some cost function c : X × Y → R. This problem can be stated as In words, the Monge problem asks for an explicit measurable map T : X → Y which turns out to take any Borel set E x ⊂ X and maps it to some Borel set TE x ⊂ Y "of the same size.". The formal property is that T preserves measure in the sense that P X (T −1 E y ) = P Y (E y ), i.e., it needs to "match" every Borel set E y of size P Y (E y ) to a corresponding Borel set The Kantorovich problem between (probability) measures P X and P Y under some cost function c : X × Y → R is the convex relaxation of the Monge problem; it only asks for an optimal transport plan in the sense that the transport does not have to be accomplished through a function as in the Monge problem, but is concentrated on the support of a joint probability distribution γ which has P X and P Y as its marginals where y ∈ Y, x ∈ X , and (P X ,P X ) is the set of all probability measures on X × Y such that the marginal distributions of some π ∈ (P X ,P Y ) are precisely P X and P Y . For many cost functions c, the solution to the Monge and Kantorovich problems actually coincides under the assumption that P X is absolutely continuous with respect to Lebesgue measure, so that it is legitimate to speak of the Monge-Kantorovich problem in these cases. Moreover, this solution is unique for many important cost functions. All of these statements can be found in Chapters 1 and 2 of Villani (2003). Different cost functions c lead to different transport maps or transport plans, many of which occur naturally in economics and econometrics. The Brenier map results from solving the Monge-Kantorovich problem under the standard squared euclidean distance as a cost function, i.e., In fact, an equivalent formulation of the Monge-Kantorovich problem under quadratic cost in the sense that they generate is sup π ∈ (P X ,P Y ) x,y dπ(x,y), (A.3) whose dual problem we consider in this article. The importance lies in the fact that the dual to the Monge-Kantorovich under the quadratic cost does not admit a solution of the form (ϕ,ϕ * ), whereas the latter does. 11 Brenier (1991) first proved that if P X and P Y possess finite second-order moments and if P X is absolutely continuous with respect to Lebesgue measure, then the Monge and Kantorovich problems coincide, and the unique solution to takes the form of the gradient of a convex function, i.e., T 0 (x) = Dϕ(x), for some convex ϕ. Moreover, one can choose ϕ as the optimal Kantorovich potential, which is the solution of (2).

B.1. Proof of Lemma 1.
Proof The result for convexity and Lipschitz continuity follow straightforwardly from the properties of the Legendre-Fenchel transform, see Theorem 3.1 in Chartrand et al. (2009). So let us turn to deriving the variations and showing Hadamard differentiability in all directions v ∈ C(X ).
We do this by a simple argument using subgradients which has already been derived in Gangbo (1994) and used in Chartrand et al. (2009) andVillani (2003, p. 74). 12 The idea is to use the limit definition of the directional derivative, which is Then, since ϕ is convex and hence differentiable almost everywhere, we can fix a y 0 ∈ Y such that Dϕ * (y 0 ) = x 0 , for some x 0 . Note on the other hand that ϕ + tv need not be convex; we therefore fix some where ∂(ϕ +tv) * denotes the subdifferential of (ϕ +tv) * . This x t is finite since X is compact. 11 I thank a referee for pointing this out. 12 A simple differentiability argument only calculates the variation in strictly convex directions v, which is not enough for our purposes. I thank a referee for pointing this out.
Then, we can write Now, the reasoning is exactly analogous to the reasoning in both Chartrand et al. (2009) andVillani (2003, p. 74). In fact, replacing x t by x 0 in the right-hand side of the last equation gives a smaller quantity, whereas replacing x 0 by x t gives a larger quantity, so that we can bound Since v is bounded and continuous, we can pass the limit into the integral by the Dominated Convergence Theorem, so that The fact that v is uniformly continuous on X implies that the sequence tv converges uniformly to 0. Therefore any convergent subsequence will converge to a maximizer of x,y 0 − ϕ(x); since x 0 is the unique minimizer (the gradient exists), it follows that x t → x 0 and hence v( and proves the form of the first variation in all directions v ∈ C(X ) at a convex ϕ. Finally, note that Hadamard differentiability of M at ϕ is equivalent to the existence of a measure μ ∈ C * (X ) such that δM ϕ (v) = X v(x)μ(dx), since M is Lipschitz continuous and hence automatically Hadamard differentiable if the directional derivative is linear (also see Chartrand et al. (2009)). We can derive this form of the first variation by the change of variables y = Dϕ * * (x), in which case we get where the last line follows from the fact that x = Dϕ * (Dϕ * * (x)), so that we can define μ(dx) := [f X (x) − f Y (Dϕ * * (x)) det(D 2 ϕ * * (x))]dx which establishes the Hadamard derivative.
Let us now turn to the second variation and compute it in all possible directions v,u ∈ C 1 (X ). In general, we can find the second variation by taking the variation of the first variation. We denote the first variation of the Legendre-Fenchel transform in direction v at point y ∈ Y by Therefore, we can calculate the second variation as where s ↓ 0 means s → 0 and s > 0.
To do so, we need to consider the map for some x s,u ∈ ∂(ϕ + su) * (y 0 ). Now, let us work from the inside out. Let us first show that Note that since u need not be convex, we can only work with the subgradient ∂(ϕ + su) * even though u ∈ C 1 (X ), since the gradient D(ϕ + su)(x) is not invertible in general. In this respect, it is important to note, however, that (ϕ + su) * is subdifferentiable, for all s, since su is differentiable everywhere and ϕ, being strictly convex, is differentiable almost everywhere and subdifferentiable everywhere. Therefore, the inverse of the subgradient, which coincides with the subgradient of the convex conjugate, is always nonempty. Now, to show (B.2), note that where ∂ C f (x) is the generalized gradient for locally Lipschitz continuous functions in the sense of Clarke (Clarke, 1975), which is the convex hull of the set of limits of the form lim i→∞ Df (x + h i ), where h i → 0 as i → ∞. Note that we can use this simple form of the definition of the generalized gradient since the functions are all locally Lipschitz continuous, because (ϕ + su) * and ϕ * are convex and hence locally Lipschitz continuous. In particular, the functions f s possess a gradient almost everywhere, i.e., Df s (y) is defined for almost every y ∈ Y and every s (Clarke, 1975;Hiriart-Urruty, 1985). The first equality in (B.3) follows by the positive homogeneity of the subdifferential and the fact that s > 0. The second equality-which is in terms of sets-follows from the property of the Clarke subdifferential of difference convex functions which states that for convex functions f 1 ,f 2 (Bačák and Borwein, 2011;Hiriart-Urruty, 1985). In order to show (B.2), we therefore need to show that we can interchange the limit and the Clarke subdifferential in (B.3). Recall from our proof of the first variation that Note that the sequence {f s } s↓0 is an equi-Lipschitzian sequence on the compact Y. This follows from the fact that f s is locally Lipschitz for all s, so that f s is Lipschitz continuous on all of the compact Y for every s; furthermore, f s converges to u•Dϕ * , which is a composition of two (almost everywhere) differentiable and hence Lipschitz continuous functions, so that there must exist a Lipschitz constant which holds uniformly for all f s . In particular, the functions f s converge uniformly to u • Dϕ for almost every y ∈ Y by Lipschitz continuity and compactness of Y.
The generalized directional derivative in the sense of Clarke (Clarke, 1975) for some h, in the same space as Dϕ * (y 0 ). In particular, note that this generalized directional derivative coincides with the standard directional derivative at almost every y 0 ∈ Y since u ∈ C 1 (X ) and Dϕ * (y) is differentiable at almost every y ∈ Y by a result from Aleksandrov (1939). Furthermore, by Proposition 1.11 in Clarke (1975), the generalized directional derivative defined on Y \ N for some set N ⊂ Y of measure zero coincides with the generalized directional derivative on all of Y, so that we can assume that y 0 and y are points where Dϕ * exists and is a single element. Now, given any m > u • (Dϕ * (y 0 ),Dϕ * (y )), there exists a λ > 0 and a small enough h such that Dϕ * (y 0 ) + h + λDϕ * (y ) ∈ Y by convexity of Y and By the uniform convergence of f s to u, which is a result of the Arzelà-Ascoli theorem and the compact support, it holds that for Y (y s ,y s ) → (y 0 ,y ) and sufficiently small s. By the definition of f • s (y s ;y s ), it therefore holds that there exists a small enough s such that f • s (y s ;y s ) < m.
Taking the limit supremum with respect to this expression implies that We may, by putting y s ≡ y for all s, conclude from this (Rockafellar, 1997, p. 234 which follows from the definition of u • . Note that the generalized directional derivative f • s (y s ;y s ) is the support function of the generalized gradient ∂ C f s (y s ) in the sense that f • s (y s ;y s ) = max{ ξ,y s : ξ ∈ ∂ C f s (y s )}, (B.4) see Clarke (1975, Prop. 1.4), analogously for u • (Dϕ * (y 0 );Dϕ * (y )) and ∂ C u(Dϕ * (y 0 )). Therefore, and by uniform convergence of f s to u, it follows directly by (B.4) that, for every ε > 0, there exists a small enough s such that f s (y 0 ) ≤ u(Dϕ * (y 0 )) + ε|y 0 | by positive homogeneity of f s and u and since Y is compact. But since the generalized derivative is the support function of the generalized gradient, it follows that where B denotes the unit ball of the same dimension as Y, see Rockafellar (1997, Thm. 24.5); this holds for every direction y s → y ∈ Y by letting ε → 0. 13 But this implies that we can interchange the limit and the generalized gradient in (B.3) to obtain (B.2). Furthermore, since u is differentiable at x 0 = Dϕ * (y 0 ), the generalized gradient becomes a gradient by Proposition 1.13 in Clarke (1975). We can therefore write and replacing x 0 + k = ∂(ϕ + su) * (y) and k by the above gives since the inner product is continuous. In other words, which is what we wanted to show. Now, since v ∈ C 1 (X ) and is hence uniformly bounded on X , we can apply the Dominated Convergence Theorem to conclude that 13 This part of the argument is analogous to the argument in the proof of Theorem 24.5 in Rockafellar (1997), only for generalized gradients and not the convex case.
where the last line follows from the change of variables y = Dϕ * * (x). This shows the formula for the second variation in directions u,v ∈ C(X ).
Let us now turn to the statements about the Fréchet derivative and the continuity of the second variation in ϕ on some neighborhood N ϕ 0 . For this, we need to assume that ϕ,v,u ∈ C 2 (X ). We can deal with both the first and second variation simultaneously. Now, in order for the Fréchet derivative to exist, δM ϕ (v) must exist, for all ϕ ∈ C 2 (K), in a neighborhood around the minimal ϕ 0 ∈ C 2 (K), for any compact K ⊂ X • , and must be continuous in ϕ (Zeidler, 1985, p. 192). The same holds for continuity of the second variation. Note that neither the first nor second variation is continuous in any ϕ everywhere on C 2 (X ) in general, since the double convex conjugate ϕ * * of some function ϕ ∈ C 2 (X ) lies in C 1,1 (X ), but does not possess higher regularity properties in general (Griewank and Rabier, 1990;Kirchheim and Kristensen, 2001). In particular, we do need to exploit the (strict) convexity of ϕ 0 .
The key to proving continuity of δM ϕ 0 (v) and δ 2 M ϕ 0 (u,v) in a neighborhood around ϕ 0 is therefore to work with the Hessian or Monge-Ampère measure (Trudinger and Wang, 2008, Sect. 2.2). In fact, recall that for any differentiable convex function f at a point x, its gradient Df coincides with the Normal map at this point. Therefore, we can, for general and not necessarily differentiable convex functions f, define a measure via the Normal mapping where Leb(S) denotes Lebesgue measure of the Borel set S. It is a standard result that the Monge-Ampère measure is an actual measure (Trudinger and Wang, 2008, Sect. 2.2). If the function f ∈ C 2 (X ), then Applying this to our setting, we set f ≡ ϕ * * . Now, fix some compact K ⊂ X • and pick a sequence {ϕ n } n∈N ∈ C 2 (X ) which converges to ϕ 0 in the topology induced by the Hölder norm · C 2 (K) . From this, it follows directly that ϕ * * n ∈ C 1,1 (K), for every n (Kirchheim and Kristensen, 2001). Now, we want to prove that this implies that ϕ * * n → ϕ 0 uniformly on K. For this, notice that since ϕ * * n are convex functions, they converge uniformly on K to some convex function f on K if they converge pointwise (Rockafellar, 1997, Thm. 10.8). We therefore need to show that this f must coincide with ϕ 0 . For this, note that since ϕ n − ϕ 0 C 2 (K) → 0, it holds that, for every point x ∈ K, there exists a large enough n x such that the functions ϕ n for n ≥ n x satisfy ϕ n (x ) > ϕ n (x) + Dϕ n (x),x − x by strict convexity of ϕ 0 . Now, by definition of the closed convex envelope of a function, i.e., the double convex conjugate, it holds that which directly implies that ϕ * * n (x) must coincide with ϕ n (x) for those points x, where ϕ n is strictly convex. As n → ∞, ϕ * * must therefore converge pointwise and by convexity also uniformly to the strictly convex ϕ 0 . Now, note that ϕ * * n → ϕ 0 uniformly on K ⊂ X • implies that μ ϕ * * n → μ ϕ 0 weakly (Trudinger and Wang, 2008, Lem. 2.2). Therefore, since v,u ∈ C 2 (X ) and f Y and f X are at least in C 2 (Y) and C 2 (X ) by Assumption 1, it follows by the definition of weak convergence that both which implies that the second variation is continuous in ϕ in a neighborhood around ϕ 0 , so that the Hadamard derivative M (ϕ 0 ) coincides with the Fréchet derivative (Zeidler, 1985, p. 192). In particular, setting v ≡ u, it follows from this that, for every ε > 0, there exists some η(ε) > 0 such that where · ∞,K denotes the supremum norm on K. The third line follows from Hölder's inequality and the fact that sup x f 2 (x) = sup x f (x) 2 for a nonnegative function f. The fourth line follows from the definition of the Hölder norm and from the sequence definition of the continuity of the second variation in ϕ we have shown above. Finally, we can show that the second variation is a bounded bilinear functional. Since X is compact and P X a probability measure, it follows that v ∈ L 2 (P X ) by Hölder's inequality, as v ∈ C 2 (X ) and hence v ∞ < +∞. Linearity of the second variation in both arguments v follows from the fact that the derivative D is a linear operator and that the inner product is linear in both arguments. To see continuity of the second variation, note that we can write . Now, since the measure f Y (Dϕ 0 (x)) det(D 2 ϕ 0 (x))dx is a probability measure, it holds that Dv L 2 (f Y (Dϕ 0 (x)) det(D 2 ϕ 0 (x))dx) ≤ Dv ∞ . Moreover, note that D : C 2 (K) → C 1 (K) is a bounded linear operator by the definition of the space C 2 (X ), so that Dv ∞ ≤ c v ∞ , for all v ∈ C 2 (K) and some 0 < c < +∞. This implies that Dv L 2 (f Y (Dϕ 0 (x)) det(D 2 ϕ 0 (x))dx) ≤ c v ∞ , which shows that δ 2 M ϕ 0 (v,v) is a continuous linear operator on C 2 (X • ). The result for the boundedness of the bilinear operator δ 2 M ϕ 0 (v,u) follows from an analogous argument, only using the operator norm δ 2 M ϕ 0 B = inf{c > 0 : |δ 2 M ϕ 0 (u,v)| ≤ c u ∞ v ∞ } for this bilinear functional.

B.2. Proof of Theorem 1.
We split its proof into several lemmas to make it easier to follow. The structure of the proof is as follows. Lemma 1 is the core of the proof, as it provides the regularity properties of the Monge-Kantorovich problem. Lemma 2 below uses Caffarelli's regularity theory to connect the smoothness of the potential function ϕ to the smoothness of the respective density functions. Lemma 3 below provides a connection between the convergence properties of the density functions and the potential function of the Brenier map, proving convergence in Hölder norm ofφ n to ϕ 0 . Lemma 4 below lets us bound the L 2 norm of the difference of the convex conjugates of the potential function of the Brenier map and another strictly convex function by the difference of these functions themselves. We need this to bound the L 2 bracketing numbers of ϕ * − ϕ * 1 by the L 2 bracketing numbers of ϕ 0 − ϕ 1 . Finally, the main proof of the theorem consists of checking the stochastic equicontinuity condition and the local identification condition that the optimum is well-separated required in Theorem 3.2.5 of van der Vaart and Wellner (2013). The local identification condition follows from the Poincaré inequality and the properties of the Monge-Kantorovich problem proved in Lemma 1. The equicontinuity follows from the lemmas and the same proof as Theorem 3.2.5 in van der Vaart and Wellner (2013).
Let us also introduce some notation. The symbol means "less than or equal to a constant multiple of." We mostly work in subspaces of L p (P) or C, which are AL or AM spaces, respectively, with partial orders is the minimal number of brackets of size ε > 0 which cover F . The letters c,c denote general constants, which can represent different values in different inequalities, a denotes the largest integer smaller or equal to a, and a denotes the smallest integer larger or equal to a.
As discussed in the main text, the optimal solution (ϕ 0 ,ψ 0 ) to the semi-dual problem can be shown to always be convex conjugates, i.e., the optimal solution can be written as (ϕ 0 ,ϕ * 0 ), see the first step in the proof of Proposition 3.1 in Brenier (1991); we can therefore write the objective function as One key step in the proof is to connect the regularity of the densities f X and f Y to the regularity of the potential function ϕ. We do this by using the interior Schauder estimates proved in the seminal article Caffarelli (1990), which we state in the following.
LEMMA 2 (Regularity of ϕ). Under Assumption 1, the potential function of the Brenier map, ϕ, between P X and P Y lies in C s+2,α loc (X • ). Moreover, and ϕ * as a mapping between P Y and P X lies in C s+2 loc (Y • ).
set Z := {z : ϕ 0 (z) < ϕ 0 (x 0 ) + l(z)} is nonempty and compactly supported inside B r/2 (x 0 ). Therefore, by the uniform convergence ofφ n to ϕ 0 (which follows from a normalization ϕ n (x 0 ) = ϕ 0 (x 0 ) and the fact that Dφ n converges to Dϕ 0 locally uniformly in conjunction with the convexity of ϕ 0 andφ n ) and the fact that the gradient Dϕ 0 is normal to l on ∂Z, it holds that Z n := {z : ϕ 0 (z) < ϕ 0 (x 0 ) + l(z)} are nonempty convex sets which converge in the Hausdorff distance to Z. Moreover, the maps w n :=φ n − l solve the following Dirichlet problem in the Aleksandrov sense: Analogously, the map w := ϕ 0 − l solves the Dirichlet problem Importantly, since l is linear, the Monge-Ampère measures 14 of ϕ and w coincide. We can therefore subtract a scaled version, by some fixed ε ∈ (0,1), of the first equation from the second equation and define v n := w − εw n . This is not a convex function in general. We therefore work with the convex envelope of w − (1 − ε)w n , which we denote by w−(1−ε)w n . 15 Since w n converges to the strictly convex w locally uniformly by the fact that Dφ n converges to Dϕ 0 locally uniformly, it holds that w−(1−ε)w n converges uniformly to εw inside a compact K ⊂ X • . Furthermore, for large enough n, the convex envelope w−(1−ε)w n almost coincides with εw, since the latter is strictly convex. Therefore, the Monge-Ampère measure of v n = w − (1 − ε)w n is uniformly elliptic away from the boundary of its support and v n ∈ C s,α loc (X ) by an application of Caffarelli's regularity theory using Lemma 2. We can hence apply standard Schauder estimates for uniformly elliptic PDEs.
In particular, by Theorem 6.2 in Gilbarg and Trudinger (1998), the following estimate holds: for some constant 0 < c < +∞. Now, the first term on the right-hand side converges to zero on X • if we let ε → 0 from what we have shown above. The second term also converges to zero if we let ε → 0 by the fact that Dφ n converges uniformly to Dϕ 0 . Since x 0 was arbitrary, this shows that In addition, when deriving the degree of continuity of the objective function, we frequently need to relate the distances ψ * − ϕ * L p (P Y ) to ψ − ϕ L p (P X ) , where ψ is some strictly convex function and ϕ is the potential function of the Brenier map transporting P X onto P Y . For this, we need the following lemma which lets us bound differences of convex conjugates above by their double convex conjugates in the L p norm, 1 ≤ p ≤ +∞.
14 For a definition of Monge-Ampère measures, see the proof of Lemma 1. 15 We defined the convex envelope in the proof of Lemma 1 as the double convex conjugate of a function.
where P X (Dψ) −1 (Dϕ) is the pushforward measure of P X via (Dψ) −1 (Dϕ). But note that we then have which follows from Corollary 2 in Stern (2010), which also provides the expression for C in terms of the Jacobian. The fact that a change of variables is a lattice isometry between L 1 spaces (Aliprantis and Border, 2006, Cor. 13.47) implies that, for p = 1, we have C = 1. Putting everything together, we have and taking the pth root on both sides gives the claim.
We are now ready to prove Theorem 1.
Proof of Theorem 1. In order to obtain the rate of convergence of the estimator, we want to apply Theorem 3.2.5 in van der Vaart and Wellner (2013). For this, we have to show that the minimum of M(ϕ) is well-separated and to obtain the degree of smoothness of the objective function M(·). In the following, we denote by the argument that minimizes M(ϕ) by ϕ 0 . We split the proof into three parts: in the first part, we show that the minimum of M is well-separated. In the second part, we derive the degree of continuity of M. In the third part, we put everything together and derive the rate of convergence. We divide the second part, deriving the degree of continuity of M, into two further steps.
First part: The minimum is well-separated. A sufficient condition for the minimum of the Kantorovich problem to be well-separated is that the second variation is linear in both arguments and continuous in v, as well as strongly positive in the sense that there exists a constant c > 0 such that for some nonnegative function G which satisfies G(0) = 0, see Theorem 3.2.5 in van der Vaart and Wellner (2013). Recall from Lemma 1 that the second variation δ 2 M ϕ 0 (v,v) in directions v ∈ C 2 (X ) takes the form Now, note that strong positivity of the second variation of M(·) holds by Assumption 3 if we define because then the inequality just reads as ϕ 0 satisfies the Monge-Ampère equation by Lemma 1. The optimum is hence wellseparated. We can therefore turn to the longer part of the proof, which deals with deriving the correct modulus of continuity of the objective function M(ϕ). Second part: Finding the correct degree of continuity of M(ϕ). In this part, we need to find the proper degree of continuity of the empirical process of the objective function, i.e., we need to find the correct function φ n such that is the smoothed empirical process and E * is the outer expectation with respect to X and Y. From now on, we restrict the setting to compact subsets of X • and Y • . To obtain the correct degree of smoothness φ, we use the multivariate generalization of Theorem 6 in Giné and Nickl (2008), which provides the optimal rate of convergence in terms of bandwidths for kernel density estimators. In order to do this, we need to show that ϕ and its convex conjugate ϕ * lie in Donsker classes and that the kernel density estimators converge in the correct Hölder spaces. We start with the convergence of the kernel density estimators.
It is hence natural to consider the classes of functions which form a δ-covering of F and F * . The corresponding envelope functions are and are also square integrable. Note that since the variance is bounded above by the L 2 norm, it holds that the brackets measured in the variance are smaller than the brackets measured in L 2 norm, so that a δ-covering of F in L 2 norm is a δ-covering of F in variance.
Step 2: Here, we put everything together, showing that φ n (δ) = δ 1− d−4 2s+2α based on the previous step. To do so, we first write and consider each term separately. In fact, we focus on the first term, as the second term is completely analogous. We use a similar idea to Radulović and Wegkamp (2000) in the following. For this, we write The second term, the bias, goes to zero under Assumptions 1 and 2, which follows from a generalization of the proof of Theorem 6 in Giné and Nickl (2008) to the multivariate case, whose assumptions are implied by ours; the proof for this is the same as the proof of the univariate case in their paper up to some minor notational changes and is therefore omitted. We can therefore focus on the first term and establish its degree of continuity. To do so, note that we can write where the third line follows from the definition of convolution of measures (Folland, 2013, p. 270) and the fourth line follows from the fact that the empirical measure P X n is an unbiased estimator of P X . Here, we have defined We therefore define the classes of functions which change with n; based on these, we consider the classes By Proposition 8.10 in Folland (2013) and the fact that K is integrable and ϕ,ϕ 0 ∈ C s+2,α , it follows thatφ,φ 0 ∈ C s+2,α . Now we want to relate the bracketing entropies log N [] ε X ,F n δ , · L 2 (P X ) to the bracketing entropy log N [] ε X ,F δ , · L ∞ (P X ) . This works, because we assume the kernel K to be nonnegative on X . In fact, a bracket [ϕ L ;ϕ U ] stays a bracket (of possibly different size) [φ L ;φ U ] if we replace ϕ L byφ L and ϕ U byφ U . Indeed, by assumption, it holds that ϕ L (x) ≤ ϕ(x) ≤ ϕ U (x), for all x ∈ X • and every ϕ ∈ [ϕ L ;ϕ U ], which is the standard partial order on C(X • ). Now, for any ϕ ∈ [ϕ L ;ϕ U ], we havē is nonnegative on X • ; the analog reasoning holds for the upper bound, which shows that [φ L ;φ U ] is a bracket when [ϕ L ;ϕ U ] is. This is the reasoning used in Radulović and Wegkamp (2000) for instance.
We also need to find an appropriate bound for the brackets. To do so, first calculate 16 where the third line follows from Minkowski's inequality for integrals (Folland, 2013, Thm. 6.19). Therefore, where the third line follows from Jensen's inequality and the fact that the kernel integrates to one. Now, by a change of variables and a Taylor expansion, we have |k|=2 D k f X ((1 − t)x + tx ) ⎤ ⎦ dxK h n (x )dx 16 We drop the region of integration, X • in this derivation in order to save on notation.
= (ϕ(x) − ϕ 0 (x)) 2 dP X (x) where the third line follows from the fact that P X (dx) = f X (x)dx and a change of variables and the last line follows from the fact that K(x )dx = 1 and x K(x )dx = 0. Now, since f X ∈ C s+1 loc and is bounded below on its support by γ , it therefore holds that where the last part of the third line follows from a change of variables z i = h −1 x i , i = 1,...,d, and the fact that (g(x)) 2 dx ≤ γ (g(x)) 2 f (x)dx for any square integrable g. The exact same reasoning as above works to show that (1)) (ϕ(x) − ϕ 0 (x)) 2 dP X (x), (B.11) for some constant C < +∞, where the second inequality follows from Lemma 4. Since P X is a probability measure, it follows from Hölder's inequality that ϕ − ϕ 0 2 ≤ ϕ − ϕ 0 ∞ . This, in combination with (B.10) and (B.11), implies that N [] δ,F n , · L 2 (P X ) N [] δ,F, · L 2 (P X ) and N [] δ,F * n , · L 2 (P Y ) N [] δ,F, · L 2 (P X ) (B.12) by the definition of bracketing numbers. We can therefore apply Lemma 3.4.2 in van der Vaart and Wellner (2013) which uses bracketing entropy for the respective empirical processes. The bracketing integrals we consider arẽ J [] (δ,F,L 2 (P X )) := δ 0 1 + log N [] ε X ,F, · L 2 (P X ) dε X .
With this lemma, the proof of the proposition is now straightforward.
We can bound the right-hand side by the fact that P X is a probability measure using Hölder's inequality, so that Dφ n − Dϕ 0 2 L 2 (P X ) ≤ c(1 + max{L 1 ,L 2 }) 2 Taking square roots, this implies that Dφ n − Dϕ 0 L 2 (P X ) = O P * (r −1/2 n ) and concludes the proof.