Search results for Statistical theory and methods

2 - Discrete Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 37-65
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter introduces random variables and explains how to use them to model uncertain numerical quantities that are discrete. We first provide a mathematical definition of random variables, building upon the framework of probability spaces. Then, we explain how to manipulate discrete random variables in practice, using their probability mass function (pmf), and describe the main properties of the pmf. Motivated by an example where we analyze Kevin Durant's free-throw shooting, we define the empirical pmf, a nonparametric estimator of the pmf that does not make strong assumptions about the data. Next, we define several popular discrete parametric distributions (Bernoulli, binomial, geometric, and Poisson), which yield parametric estimators of the pmf, and explain how to fit them to data via maximum-likelihood estimation. We conclude the chapter by comparing the advantages and disadvantages of nonparametric and parametric models, illustrated by a real-data example, where we model the number of calls arriving at a call center.

7 - Averaging
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 241-283
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter begins by defining an averaging procedure for random variables, known as the mean. We show that the mean is linear, and also that the mean of the product of independent variables equals the product of their means. Then, we derive the mean of popular parametric distributions. Next, we caution that the mean can be severely distorted by extreme values, as illustrated by an analysis of NBA salaries. In addition, we define the mean square, which is the average squared value of a random variable, and the variance, which is the mean square deviation from the mean. We explain how to estimate the variance from data and use it to describe temperature variability at different geographic locations. Then, we define the conditional mean, a quantity that represents the average of a variable when other variables are fixed. We prove that the conditional mean is an optimal solution to the problem of regression, where the goal is to estimate a quantity of interest as a function of other variables. We end the chapter by studying how to estimate average causal effects.

Book Website
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp xiii-xiv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

11 - Principal Component Analysis and Low-Rank Models
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 433-494
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter covers principal component analysis and low-rank models, which are popular techniques to process high-dimensional datasets with many features. We begin by defining the mean of random vectors and random matrices. Then, we introduce the covariance matrix which encodes the variance of any linear combination of the entries in a random vector, and explain how to estimate it from data. We model the geographic location of Canadian cities as a running example. Next, we present principal component analysis (PCA), a method to extract the directions of maximum variance in a dataset. We explain how to use PCA to find optimal low-dimensional representations of high-dimensional data and apply it to a dataset of human faces. Then, we introduce low-rank models for matrix-valued data and describe how to fit them using the singular-value decomposition. We show that this approach is able to automatically identify meaningful patterns in real-world weather data. Finally, we explain how to estimate missing entries in a matrix under a low-rank assumption and apply this methodology to predict movie ratings via collaborative filtering.

3 - Continuous Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 66-108
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter introduces continuous random variables which enable us to model uncertain continuous quantities. We again begin with a formal definition, but quickly move on to describe how to manipulate continuous random variables in practice. We define the cumulative distribution function and quantiles (including the median) and explain how to estimate them from data. We then introduce the concept of probability density and describe its main properties. We present two approaches to obtain nonparametric models of probability densities from data: The histogram and kernel density estimation. Next, we define two celebrated continuous parametric distributions – the exponential and the Gaussian – and show how to fit them to data using maximum-likelihood estimation. We use these distributions to model the interarrival time of calls at a call center, and height in a population, respectively. Finally, we discuss how to simulate continuous random variables via inverse transform sampling.

Introduction and Overview
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 1-5
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter provides an overview of the book, describing the contents of each chapter.

References
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 602-603
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

1 - Probability
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 6-36
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter introduces probability. We begin with an informal definition which enables us to build intuition about the properties of probability. Then, we present a more rigorous definition, based on the mathematical framework of probability spaces. Next, we describe conditional probability, a concept that makes it possible to update probabilities when additional information is revealed. In our first encounter with statistics, we explain how to estimate probabilities and conditional probabilities from data, as illustrated by an analysis of votes in the United States Congress. Building upon the concept of conditional probability, we define independence and conditional independence, which are critical concepts in probabilistic modeling. The chapter ends with a surprising twist: In practice, probabilities are often impossible to compute analytically! Fortunately, the Monte Carlo method provides a pragmatic solution to this challenge, allowing us to approximate probabilities very accurately using computer simulations. We apply w 3 × 3 basketball tournament from the 2020 Tokyo Olympics.

Probability and Statistics for Data Science

Carlos Fernandez-Granda
Published online:

19 June 2025

Print publication:

03 July 2025
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This self-contained guide introduces two pillars of data science, probability theory, and statistics, side by side, in order to illuminate the connections between statistical techniques and the probabilistic concepts they are based on. The topics covered in the book include random variables, nonparametric and parametric models, correlation, estimation of population parameters, hypothesis testing, principal component analysis, and both linear and nonlinear methods for regression and classification. Examples throughout the book draw from real-world datasets to demonstrate concepts in practice and confront readers with fundamental challenges in data science, such as overfitting, the curse of dimensionality, and causal inference. Code in Python reproducing these examples is available on the book's website, along with videos, slides, and solutions to exercises. This accessible book is ideal for undergraduate and graduate students, data science practitioners, and others interested in the theoretical concepts underlying data science methods.

7 - Spatial Dependent Sequences
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 102-117
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we present two spatial dependent models: one based on defining a latent variable for each area, and the other by defining one latent variable for each pair of latent areas. We call the latter the latent edges model. We compare both models with a real data set. Extensions to spatio-temporal constructions are also considered.

2 - Conjugate Models
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 23-32
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we define what a conjugate family is in a Bayesian analysis context and develop detailed examples of some cases; in particular, we review the beta and binomial case, the Pareto and inverse Pareto case, the gamma and gamma case and the gamma and Poisson case. We conclude by providing a list of conjugate models.

Preface
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp ix-xii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Acknowledgments
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp xiii-xiv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Dedication
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Index
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 135-136
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Temporal Dependent Sequences
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 84-101
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we show how to define temporal dependent sequences using a moving average type of construction. We compare the performance of this construction with a Markov-process type. We finally extend the models to include seasonal and periodic dependencies.

5 - General Dependent Sequences
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 70-83
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we start with some attempts to construct dependence sequences with order larger than one and present a general result to achieve an invariant distribution via a three-level hierarchical model. We finally present some results involving exponential families.

Frontmatter
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

4 - Markov Sequences
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 51-69
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter we describe a general procedure to construct Markov sequences with invariant distributions. The procedure can be used with conjugate and non-conjugate models and with parametric and nonparametric distributions. We derive several examples in detail and finish with some applications in survival analysis.

Appendix - Data Sets
Luis E. Nieto-Barajas, Instituto Tecnológico Autónomo de México (ITAM)
Book:

Dependence Models via Hierarchical Structures

Published online:

20 March 2025

Print publication:

27 March 2025, pp 125-131
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Statistical theory and methods

Refine search

Refine search

Actions for selected content:

2351 results in Statistical theory and methods

2 - Discrete Variables

Summary

7 - Averaging

Summary

Book Website

11 - Principal Component Analysis and Low-Rank Models

Summary

3 - Continuous Variables

Summary

Introduction and Overview

Summary

References

1 - Probability

Summary

Probability and Statistics for Data Science

7 - Spatial Dependent Sequences

Summary

2 - Conjugate Models

Summary

Preface

Acknowledgments

Dedication

Index

6 - Temporal Dependent Sequences

Summary

5 - General Dependent Sequences

Summary

Frontmatter

4 - Markov Sequences

Summary

Appendix - Data Sets

Statistical theory and methods

Refine search

Refine search

Actions for selected content:

Save Search

2351 results in Statistical theory and methods

Summary

Summary

Summary

Summary

Summary

Summary

Probability and Statistics for Data Science

Summary

Summary

Summary

Summary

Summary