Search results for Statistics and Probability

Contents
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp vii-x
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Discrete and Continuous Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 202-240
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter discusses how to build probabilistic models that include both discrete and continuous variables. Mathematically, this is achieved by defining them as random variables within the same probability space. In practice, the variables are manipulated using their marginal and conditional distributions. We define the conditional pmf of a discrete random variable given a continuous variable, and the conditional probability density of a continuous random variable given a discrete variable. We use these objects to build mixture models and apply them to model height in a population. Next, we describe Gaussian discriminant analysis, a classification method based on mixture models with Gaussian conditional distributions, and apply it to diagnose Alzheimer's disease. Then, we explain how to perform clustering using Gaussian mixture models and leverage the approach to cluster NBA players. Finally, we introduce the framework of Bayesian statistics which enables us to explicitly encode our uncertainty about model parameters, and use it to analyze poll data from the 2020 United States presidential election.

Index
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 604-608
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

2 - Discrete Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 37-65
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter introduces random variables and explains how to use them to model uncertain numerical quantities that are discrete. We first provide a mathematical definition of random variables, building upon the framework of probability spaces. Then, we explain how to manipulate discrete random variables in practice, using their probability mass function (pmf), and describe the main properties of the pmf. Motivated by an example where we analyze Kevin Durant's free-throw shooting, we define the empirical pmf, a nonparametric estimator of the pmf that does not make strong assumptions about the data. Next, we define several popular discrete parametric distributions (Bernoulli, binomial, geometric, and Poisson), which yield parametric estimators of the pmf, and explain how to fit them to data via maximum-likelihood estimation. We conclude the chapter by comparing the advantages and disadvantages of nonparametric and parametric models, illustrated by a real-data example, where we model the number of calls arriving at a call center.

7 - Averaging
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 241-283
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter begins by defining an averaging procedure for random variables, known as the mean. We show that the mean is linear, and also that the mean of the product of independent variables equals the product of their means. Then, we derive the mean of popular parametric distributions. Next, we caution that the mean can be severely distorted by extreme values, as illustrated by an analysis of NBA salaries. In addition, we define the mean square, which is the average squared value of a random variable, and the variance, which is the mean square deviation from the mean. We explain how to estimate the variance from data and use it to describe temperature variability at different geographic locations. Then, we define the conditional mean, a quantity that represents the average of a variable when other variables are fixed. We prove that the conditional mean is an optimal solution to the problem of regression, where the goal is to estimate a quantity of interest as a function of other variables. We end the chapter by studying how to estimate average causal effects.

Book Website
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp xiii-xiv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Preface
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp xi-xii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

11 - Principal Component Analysis and Low-Rank Models
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 433-494
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter covers principal component analysis and low-rank models, which are popular techniques to process high-dimensional datasets with many features. We begin by defining the mean of random vectors and random matrices. Then, we introduce the covariance matrix which encodes the variance of any linear combination of the entries in a random vector, and explain how to estimate it from data. We model the geographic location of Canadian cities as a running example. Next, we present principal component analysis (PCA), a method to extract the directions of maximum variance in a dataset. We explain how to use PCA to find optimal low-dimensional representations of high-dimensional data and apply it to a dataset of human faces. Then, we introduce low-rank models for matrix-valued data and describe how to fit them using the singular-value decomposition. We show that this approach is able to automatically identify meaningful patterns in real-world weather data. Finally, we explain how to estimate missing entries in a matrix under a low-rank assumption and apply this methodology to predict movie ratings via collaborative filtering.

3 - Continuous Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 66-108
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter introduces continuous random variables which enable us to model uncertain continuous quantities. We again begin with a formal definition, but quickly move on to describe how to manipulate continuous random variables in practice. We define the cumulative distribution function and quantiles (including the median) and explain how to estimate them from data. We then introduce the concept of probability density and describe its main properties. We present two approaches to obtain nonparametric models of probability densities from data: The histogram and kernel density estimation. Next, we define two celebrated continuous parametric distributions – the exponential and the Gaussian – and show how to fit them to data using maximum-likelihood estimation. We use these distributions to model the interarrival time of calls at a call center, and height in a population, respectively. Finally, we discuss how to simulate continuous random variables via inverse transform sampling.

Introduction and Overview
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 1-5
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter provides an overview of the book, describing the contents of each chapter.

5 - Spatial Partitioning
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp 136-162
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter examines the related objectives of defining spatial clusters and delineating spatial boundaries in discontinuous data. The former often proceeds by grouping together adjacent locations when they have the most similar characteristics; the latter proceeds by estimating boundaries between locations that are most different. For this, there are several methods available that suggest ’boundary elements’ as possible components of a final division or complete boundary, depending on the kind of data (e.g. binary versus qualitative versus continuous quantitative) and the arrangement of the measured locations (e.g. regular lattice versus irregular spatial network). Once boundaries have been established, statistics are available to evaluate them, including boundary overlap measures. Clusters and boundaries represent two aspects of the same phenomenon, with the same challenge of formalizing similarity and difference in continuous spatial data.

Contents
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp v-x
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Spatial Autocorrelation and Inferential Tests
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp 163-188
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The presence of autocorrelation in data violates the usual assumption of independence in the data for evaluating inferential statistics. We describe several models of autocorrelation in spatial data (both positive and negative). Given two serial variables, x and y, autocorrelation observed in y can be due to inherent autoregression in the variable itself, autoregression induced by its dependence on x, which has its own autocorrelation, or doubly autoregressive, with autocorrelation in both variables. This effect can be addressed by estimating the effective sample size (number of independent observations equivalent in information content to the n that are autocorrelated). We present the calculation of the effective sample size for many inferential statistics, including correlation, partial correlation, t-tests and ANOVA. The use of restricted randomization is explained as a method for testing when other approaches are not available. We also provide recommendations for sampling and experimental design in the presence of spatial autocorrelation.

References
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp 361-393
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

References
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 602-603
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

7 - Spatial Regression and Multiscale Analysis
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp 189-226
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Quantifying the relationships between variables is affected by the spatial structure in which they occur and the scales of the processes that affect them. First, this chapter covers the topics of spatial regression, spatial causal inference and the Mantel and partial Mantel statistics. These are all methods designed to assess the relationships between variables of interest within a spatial structure. Then, multiscale analysis is presented because it is key to understanding how ecological processes and patterns change with the scale of observation. Indeed, multiscale analysis has become increasingly important as ecologists address studies at larger and larger scales with increasing probability of significant spatial heterogeneity. We describe several approaches, including multiscale ordination (MSO), Morán’s eigenvector maps (MEMs) and wavelet decomposition.

1 - Probability
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 6-36
- Chapter
- - You have access
- PDF
- Export citation
Summary

This chapter introduces probability. We begin with an informal definition which enables us to build intuition about the properties of probability. Then, we present a more rigorous definition, based on the mathematical framework of probability spaces. Next, we describe conditional probability, a concept that makes it possible to update probabilities when additional information is revealed. In our first encounter with statistics, we explain how to estimate probabilities and conditional probabilities from data, as illustrated by an analysis of votes in the United States Congress. Building upon the concept of conditional probability, we define independence and conditional independence, which are critical concepts in probabilistic modeling. The chapter ends with a surprising twist: In practice, probabilities are often impossible to compute analytically! Fortunately, the Monte Carlo method provides a pragmatic solution to this challenge, allowing us to approximate probabilities very accurately using computer simulations. We apply w 3 × 3 basketball tournament from the 2020 Tokyo Olympics.

1 - Ecological Processes
Mark R. T. Dale, University of Northern British Columbia, Marie-Josée Fortin, University of Toronto
Book:

Spatial Analysis

Published online:

19 June 2025

Print publication:

03 July 2025, pp 1-19
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This first chapter sets the context for the topics covered throughout the book by introducing the relationship between ecological processes and spatial structure, and by clarifying terminology related to both. These processes and spatial analysis methods are classified by several criteria, including static versus dynamic data and one versus several species. The concept of scale is applied to spatial, temporal and organizational contexts. The chapter provides a discussion regarding the background and motivation for spatial analysis in ecological research.

Collective Defined Contribution Design Principles
Journal:

British Actuarial Journal / Volume 30 / 2025

Published online by Cambridge University Press:

01 July 2025, e23
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation

Transmission pathways and risk factors for sporadic salmonellosis and campylobacteriosis: a source attribution meta-analysis of European case-control studies
Lapo Mughini-Gras, Lena Wijnen, Sara M. Pires, Elisa Benincà, Charlotte Onstwedder, Tine Hald, Eelco Franz, Axel Bonacic Marinovic
Journal:

Epidemiology & Infection / Volume 153 / 2025

Published online by Cambridge University Press:

01 July 2025, e77
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Case-control studies can provide attribution estimates of the likely sources of zoonotic pathogens. We applied a meta-analytical model within a Bayesian estimation framework to pool population attributable fractions (PAFs) from European case-control studies of sporadic campylobacteriosis and salmonellosis. The input data were obtained from two existing systematic reviews, supplemented with additional literature searches, covering the period 2000–2021. In total, 12 studies on Campylobacter providing data for 180 PAFs referring to 5983 cases and 13213 controls, and five studies on Salmonella providing data for 75 PAFs referring to 2908 cases and 5913 controls, were included. All these studies were conducted in Western or Northern European countries. Both pathogens were estimated as being predominantly linked to food- and waterborne transmission, which explained nearly half of the cases, with Campylobacter being mainly attributable to poultry (meat), and Salmonella to poultry (eggs and meat) and pig (meat), as specific foodborne exposures. When also considering contact with animals, around 60% of cases could be explained by the larger group of zoonotic transmission pathways. While environmental transmission was also sizeable (around 10%), about a quarter of cases could be explained by factors such as travel, underlying diseases/medicine use, person-to-person transmission and occupational exposure.

Statistics and Probability

Refine search

Refine search

Actions for selected content:

52376 results in Statistics and Probability

Contents

6 - Discrete and Continuous Variables

Summary

Index

2 - Discrete Variables

Summary

7 - Averaging

Summary

Book Website

Preface

11 - Principal Component Analysis and Low-Rank Models

Summary

3 - Continuous Variables

Summary

Introduction and Overview

Summary

5 - Spatial Partitioning

Summary

Contents

6 - Spatial Autocorrelation and Inferential Tests

Summary

References

References

7 - Spatial Regression and Multiscale Analysis

Summary

1 - Probability

Summary

1 - Ecological Processes

Summary

Collective Defined Contribution Design Principles

Transmission pathways and risk factors for sporadic salmonellosis and campylobacteriosis: a source attribution meta-analysis of European case-control studies

Statistics and Probability

Refine search

Refine search

Actions for selected content:

Save Search

52376 results in Statistics and Probability

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary