We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Gaussian random polytopes have received a lot of attention, especially in the case where the dimension is fixed and the number of points goes to infinity. Our focus is on the less-studied case where the dimension goes to infinity and the number of points is proportional to the dimension d. We study several natural quantities associated with Gaussian random polytopes in this setting. First, we show that the expected number of facets is equal to $C(\alpha)^{d+o(d)}$, where $C(\alpha)$ is some constant which depends on the constant of proportionality $\alpha$. We also extend this result to the expected number of k-facets. We then consider the more difficult problem of the asymptotics of the expected number of pairs of estranged facets of a Gaussian random polytope. When the number of points is 2d, we determine the constant C such that the expected number of pairs of estranged facets is equal to $C^{d+o(d)}$.
Our primary result concerns the positivity of specific kernels constructed using the q-ultraspherical polynomials. In other words, it concerns a two-parameter family of bivariate, compactly supported distributions. Moreover, this family has a property that all its conditional moments are polynomials in the conditioning random variable. The significance of this result is evident for individuals working on distribution theory, orthogonal polynomials, q-series theory, and the so-called quantum polynomials. Therefore, it may have a limited number of interested researchers. That is why, we put our results into a broader context. We recall the theory of Hilbert–Schmidt operators and the idea of Lancaster expansions (LEs) of the bivariate distributions absolutely continuous with respect to the product of their marginal distributions. Applications of LE can be found in Mathematical Statistics or the creation of Markov processes with polynomial conditional moments (the most well-known of these processes is the famous Wiener process).
In network science, one of the significant and challenging subjects is the detection of communities. Modularity [1] is a measure of community structure that compares connectivity in the network with the expected connectivity in a graph sampled from a random null model. Its optimisation is a common approach to tackle the community detection problem. We present a new method for modularity maximisation, which is based on the observation that modularity can be expressed in terms of total variation on the graph and signless total variation on the null model. The resulting algorithm is of Merriman–Bence–Osher (MBO) type. Different from earlier methods of this type, the new method can easily accommodate different choices of the null model. Besides theoretical investigations of the method, we include in this paper numerical comparisons with other community detection methods, among which the MBO-type methods of Hu et al. [2] and Boyd et al. [3], and the Leiden algorithm [4].
Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the maximum a posteriori (MAP) estimator for clustering a degree corrected stochastic block model when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.
Novel prediction methods should always be compared to a baseline to determine their performance. Without this frame of reference, the performance score of a model is basically meaningless. What does it mean when a model achieves an $F_1$ of 0.8 on a test set? A proper baseline is, therefore, required to evaluate the ‘goodness’ of a performance score. Comparing results with the latest state-of-the-art model is usually insightful. However, being state-of-the-art is dynamic, as newer models are continuously developed. Contrary to an advanced model, it is also possible to use a simple dummy classifier. However, the latter model could be beaten too easily, making the comparison less valuable. Furthermore, most existing baselines are stochastic and need to be computed repeatedly to get a reliable expected performance, which could be computationally expensive. We present a universal baseline method for all binary classification models, named the Dutch Draw (DD). This approach weighs simple classifiers and determines the best classifier to use as a baseline. Theoretically, we derive the DD baseline for many commonly used evaluation measures and show that in most situations it reduces to (almost) always predicting either zero or one. Summarizing, the DD baseline is general, as it is applicable to any binary classification problem; simple, as it can be quickly determined without training or parameter tuning; and informative, as insightful conclusions can be drawn from the results. The DD baseline serves two purposes. First, it is a robust and universal baseline that enables comparisons across research papers. Second, it provides a sanity check during the prediction model’s development process. When a model does not outperform the DD baseline, it is a major warning sign.
We study the community detection problem on a Gaussian mixture model, in which vertices are divided into $k\geq 2$ distinct communities. The major difference in our model is that the intensities for Gaussian perturbations are different for different entries in the observation matrix, and we do not assume that every community has the same number of vertices. We explicitly find the necessary and sufficient conditions for the exact recovery of the maximum likelihood estimation, which can give a sharp phase transition for the exact recovery even though the Gaussian perturbations are not identically distributed; see Section 7. Applications include the community detection on hypergraphs.
We explore the limiting spectral distribution of large-dimensional random permutation matrices, assuming the underlying population distribution possesses a general dependence structure. Let $\textbf X = (\textbf x_1,\ldots,\textbf x_n)$$\in \mathbb{C} ^{m \times n}$ be an $m \times n$ data matrix after self-normalization (n samples and m features), where $\textbf x_j = (x_{1j}^{*},\ldots, x_{mj}^{*} )^{*}$. Specifically, we generate a permutation matrix $\textbf X_\pi$ by permuting the entries of $\textbf x_j$$(j=1,\ldots,n)$ and demonstrate that the empirical spectral distribution of $\textbf {B}_n = ({m}/{n})\textbf{U} _{n} \textbf{X} _\pi \textbf{X} _\pi^{*} \textbf{U} _{n}^{*}$ weakly converges to the generalized Marčenko–Pastur distribution with probability 1, where $\textbf{U} _n$ is a sequence of $p \times m$ non-random complex matrices. The conditions we require are $p/n \to c >0$ and $m/n \to \gamma > 0$.
Measures of uncertainty are a topic of considerable and growing interest. Recently, the introduction of extropy as a measure of uncertainty, dual to Shannon entropy, has opened up interest in new aspects of the subject. Since there are many versions of entropy, a unified formulation has been introduced to work with all of them in an easy way. Here we consider the possibility of defining a unified formulation for extropy by introducing a measure depending on two parameters. For particular choices of parameters, this measure provides the well-known formulations of extropy. Moreover, the unified formulation of extropy is also analyzed in the context of the Dempster–Shafer theory of evidence, and an application to classification problems is given.
We derive an explicit formula for the N-point correlation $F_N(s)$ of the van der Corput sequence in base $2$ for all $N \in \mathbb {N}$ and $s \geq 0$. The formula can be evaluated without explicit knowledge about the elements of the van der Corput sequence. This constitutes the first example of an exact closed-form expression of $F_N(s)$ for all $N \in \mathbb {N}$ and all $s \geq 0$ which does not require explicit knowledge about the involved sequence. Moreover, it can be immediately read off that $\lim _{N \to \infty } F_N(s)$ exists only for $0 \leq s \leq 1/2$.
Copulas provide a powerful and flexible tool for modeling the dependence structure of random vectors, and they have many applications in finance, insurance, engineering, hydrology, and other fields. One well-known class of copulas in two dimensions is the Farlie–Gumbel–Morgenstern (FGM) copula, since its simple analytic shape enables closed-form solutions to many problems in applied probability. However, the classical definition of the high-dimensional FGM copula does not enable a straightforward understanding of the effect of the copula parameters on the dependence, nor a geometric understanding of their admissible range. We circumvent this issue by analyzing the FGM copula from a probabilistic approach based on multivariate Bernoulli distributions. This paper examines high-dimensional exchangeable FGM copulas, a subclass of FGM copulas. We show that the dependence parameters of exchangeable FGM copulas can be expressed as a convex hull of a finite number of extreme points. We also leverage the probabilistic interpretation to develop efficient sampling and estimating procedures and provide a simulation study. Throughout, we discover geometric interpretations of the copula parameters that assist one in decoding the dependence of high-dimensional exchangeable FGM copulas.
The Minkowski functionals, including the Euler characteristic statistics, are standard tools for morphological analysis in cosmology. Motivated by cosmic research, we examine the Minkowski functional of the excursion set for an isotropic central limit random field, whose k-point correlation functions (kth-order cumulants) have the same structure as that assumed in cosmic research. Using 3- and 4-point correlation functions, we derive the asymptotic expansions of the Euler characteristic density, which is the building block of the Minkowski functional. The resulting formula reveals the types of non-Gaussianity that cannot be captured by the Minkowski functionals. As an example, we consider an isotropic chi-squared random field and confirm that the asymptotic expansion accurately approximates the true Euler characteristic density.
In this paper, we consider a financial or insurance system with a finite number of individual risks described by real-valued random variables. We focus on two kinds of risk measures, referred to as the tail moment (TM) and the tail central moment (TCM), which are defined as the conditional moment and conditional central moment of some individual risk in the event of system crisis. The first-order TM and the second-order TCM coincide with the popular risk measures called the marginal expected shortfall and the tail variance, respectively. We derive asymptotic expressions for the TM and TCM with any positive integer orders, when the individual risks are pairwise asymptotically independent and have distributions from certain classes that contain both light-tailed and heavy-tailed distributions. The formulas obtained possess concise forms unrelated to dependence structures, and hence enable us to estimate the TM and TCM efficiently. To demonstrate the wide application of our results, we revisit some issues related to premium principles and optimal capital allocation from the asymptotic point of view. We also give a numerical study on the relative errors of the asymptotic results obtained, under some specific scenarios when there are two individual risks in the system. The corresponding asymptotic properties of the degenerate univariate versions of the TM and TCM are discussed separately in an appendix at the end of the paper.
Risk measurements are clearly central to risk management, in particular for banks, (re)insurance companies, and investment funds. The question of the appropriateness of risk measures for evaluating the risk of financial institutions has been heavily debated, especially after the financial crisis of 2008/2009. Another concern for financial institutions is the pro-cyclicality of risk measurements. In this paper, we extend existing work on the pro-cyclicality of the Value-at-Risk to its main competitors, Expected Shortfall, and Expectile: We compare the pro-cyclicality of historical quantile-based risk estimation, taking into account the market state. To characterise the latter, we propose various estimators of the realised volatility. Considering the family of augmented GARCH(p, q) processes (containing well-known GARCH models and iid models, as special cases), we prove that the strength of pro-cyclicality depends on the three factors: the choice of risk measure and its estimators, the realised volatility estimator and the model considered, but, no matter the choices, the pro-cyclicality is always present. We complement this theoretical analysis by performing simulation studies in the iid case and developing a case study on real data.
Measuring and quantifying dependencies between random variables (RVs) can give critical insights into a dataset. Typical questions are: ‘Do underlying relationships exist?’, ‘Are some variables redundant?’, and ‘Is some target variable Y highly or weakly dependent on variable X?’ Interestingly, despite the evident need for a general-purpose measure of dependency between RVs, common practice is that most data analysts use the Pearson correlation coefficient to quantify dependence between RVs, while it is recognized that the correlation coefficient is essentially a measure for linear dependency only. Although many attempts have been made to define more generic dependency measures, there is no consensus yet on a standard, general-purpose dependency function. In fact, several ideal properties of a dependency function have been proposed, but without much argumentation. Motivated by this, we discuss and revise the list of desired properties and propose a new dependency function that meets all these requirements. This general-purpose dependency function provides data analysts with a powerful means to quantify the level of dependence between variables. To this end, we also provide Python code to determine the dependency function for use in practice.
Negative dependence of sequences of random variables is often an interesting characteristic of their distribution, as well as a useful tool for studying various asymptotic results, including central limit theorems, Poisson approximations, the rate of increase of the maximum, and more. In the study of probability models of tournaments, negative dependence of participants’ outcomes arises naturally, with application to various asymptotic results. In particular, the property of negative orthant dependence was proved in several articles for different tournament models, with a special proof for each model. In this note we unify these results by proving a stronger property, negative association, a generalization leading to a very simple proof. We also present a natural example of a knockout tournament where the scores are negatively orthant dependent but not negatively associated. The proof requires a new result on a preservation property of negative orthant dependence that is of independent interest.
Gromov–Wasserstein distances were proposed a few years ago to compare distributions which do not lie in the same space. In particular, they offer an interesting alternative to the Wasserstein distances for comparing probability measures living on Euclidean spaces of different dimensions. We focus on the Gromov–Wasserstein distance with a ground cost defined as the squared Euclidean distance, and we study the form of the optimal plan between Gaussian distributions. We show that when the optimal plan is restricted to Gaussian distributions, the problem has a very simple linear solution, which is also a solution of the linear Gromov–Monge problem. We also study the problem without restriction on the optimal plan, and provide lower and upper bounds for the value of the Gromov–Wasserstein distance between Gaussian distributions.