Search results for Statistical theory and methods

Appendix C - Overview of Machine Learning Techniques
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp 454-455
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This appendix provides a concise introduction to key machine learning techniques employed throughout the book. It focuses on two main areas: unsupervised learning and Bayesian classification. The appendix begins with an exploration of K-means clustering, a fundamental unsupervised learning algorithm, demonstrating its application to network community detection. It then discusses methods for evaluating unsupervised learning techniques, including confusion matrices and the adjusted Rand index. The silhouette score is introduced as a metric for assessing clustering quality across different numbers of clusters. The appendix concludes with an explanation of the Bayes plugin classifier, a simple yet effective tool for network classification tasks.

Part III - Applications
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp 257-258
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Contents
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp v-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Terminology
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp xvii-xviii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

3 - Characterizing and Preparing Network Data
from Part II - Representations
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp 43-97
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter establishes the foundation for network machine learning. We begin with network fundamentals: adjacency matrices, edge directionality, node loops, and edge weights. We then explore node-specific properties such as degree and path length, followed by network-wide metrics including density, clustering coefficients, and average path lengths. The chapter progresses to advanced matrix representations, notably degree matrices and various Laplacian forms, which are crucial for spectral analysis methods. We examine subnetworks and connected components, tools for focusing on relevant network structures. The latter half of the chapter delves into preprocessing techniques. We cover node pruning methods to manage outliers and low-degree nodes. Edge regularization techniques, including thresholding and sparsification, address issues in weighted and dense networks. Finally, we explore edge-weight rescaling methods such as z-score standardization and ranking-based approaches. Throughout, we emphasize practical applications, illustrating concepts with examples and code snippets. These preprocessing steps are vital for addressing noise, sparsity, and computational challenges in network data. By mastering these concepts and techniques, readers will be well-equipped to prepare network data for sophisticated machine learning tasks, setting the stage for the advanced methods presented in subsequent chapters.

9 - Deep Learning Methods
from Part III - Applications
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp 377-410
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter explores deep learning methods for network analysis, focusing on graph neural networks (GNNs) and diffusion-based approaches. We introduce GNNs through a drug discovery case study, demonstrating how molecular structures can be analyzed as networks. The chapter covers GNN architecture, training processes, and their ability to learn complex network representations without explicit feature engineering. We then examine diffusion-based methods, which use random walks to develop network embeddings. These techniques are compared and contrasted with earlier spectral approaches, highlighting their capacity to capture nonlinear relationships and local network structures. Practical implementations using frameworks such as PyTorch Geometric illustrate the application of these methods to large-scale network datasets, showcasing their power in addressing complex network problems across various domains.

8 - Applications for Multiple Networks
from Part III - Applications
Eric W. Bridgeford, The Johns Hopkins University, Alexander R. Loftus, The Johns Hopkins University, Joshua T. Vogelstein, The Johns Hopkins University
Book:

Hands-On Network Machine Learning with Python

Published online:

23 September 2025

Print publication:

18 September 2025, pp 357-376
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter explores advanced applications of network machine learning for multiple networks. We introduce anomaly detection in time series of networks, identifying significant structural changes over time. The chapter then focuses on signal subnetwork estimation for network classification tasks. We present both incoherent and coherent approaches, with incoherent methods identifying edges that best differentiate between network classes, and coherent methods leveraging additional network structure to improve classification accuracy. Practical applications, such as classifying brain networks, are emphasized throughout. These techniques apply to collections of networks, providing a toolkit for analyzing and classifying complex, multinetwork datasets. By integrating previous concepts with new methodologies, we offer a framework for extracting insights and making predictions from diverse network structures with associated attributes.

Preface
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp xi-xii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

5 - Multiple Continuous Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 161-201
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this chapter, we describe how to jointly model continuous quantities, by representing them as multiple continuous random variables within the same probability space. We define the joint cumulative distribution function and the joint probability density function and explain how to estimate the latter from data using a multivariate generalization of kernel density estimation. Next, we introduce marginal and conditional distributions of continuous variables and also discuss independence and conditional independence. Throughout, we model real-world temperature data as a running example. Then, we explain how to jointly simulate multiple random variables, in order to correctly account for the dependence between them. Finally, we define Gaussian random vectors which are the most popular multidimensional parametric model for continuous data, and apply them to model anthropometric data.

8 - Correlation
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 284-324
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter focuses on correlation, a key metric in data science that quantifies to what extent two quantities are linearly related. We begin by defining correlation between normalized and centered random variables. Then, we generalize the definition to all random variables and introduce the concept of covariance, which measures the average joint variation of two random variables. Next, we explain how to estimate correlation from data and analyze the correlation between the height of NBA players and different basketball stats.In addition, we study the connection between correlation and simple linear regression. We then discuss the differences between uncorrelation and independence. In order to gain better intuition about the properties of correlation, we provide a geometric interpretation of correlation, where the covariance is an inner product between random variables. Finally, we show that correlation does not imply causation, as illustrated by the spurious correlation between temperature and unemployment in Spain.

Dedication
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp v-vi
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

10 - Hypothesis Testing
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 390-432
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter presents hypothesis testing which is used to evaluate whether the available data provide sufficient evidence to support a certain hypothesis. The main idea is to play devil's advocate and assume a null hypothesis, which contradicts our hypothesis of interest. We explain how to use parametric modeling to implement this idea, and define the p-value. We prove that thresholding the p-value controls the probability of false positives. In addition, we define the power of a test, which quantifies the test's ability to identify positive findings. Next, we show how to perform hypothesis testing without a parametric model, focusing on the permutation test. Then, we discuss multiple testing, a setting where many tests are performed simultaneously. Finally, we provide three reasons why hypothesis testing should not be used as the only stamp of approval for scientific discoveries. First, hypothesis testing does not necessarily identify causal effects; it is complementary to causal inference. Second, small p-values do not imply practical significance. Third, relying on p-values to validate findings produces a strong incentive to cherry-pick results.

Frontmatter
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

9 - Estimation of Population Parameters
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 325-389
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter explains how to estimate population parameters from data. We introduce random sampling, an approach that yields accurate estimates from limited data. We then define the bias and the standard error, which quantify the average error of an estimator and how much it varies, respectively. In addition, we derive deviation bounds and use them to prove the law of large numbers, which states that averaging many independent samples from a distribution yields an accurate estimate of its mean. An important consequence is that random sampling provides a precise estimate of means and proportions. However, we caution that this is not necessarily the case, if the data contain extreme values. Next, we discuss the central limit theorem (CLT), according to which averages of independent quantities tend to be Gaussian. We again provide a cautionary tale, warning that this does not hold in the absence of independence. Then, we explain how to use the CLT to build confidence intervals which quantify the uncertainty of estimates obtained from finite data. Finally, we introduce the bootstrap, a popular computational technique to estimate standard errors and build confidence intervals.

12 - Regression and Classification
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 495-598
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter covers regression and classification, where the goal is to estimate a quantity of interest (the response) from observed features. In regression, the response is a numerical variable. In classification, it belongs to a finite set of predetermined classes. We begin with a comprehensive description of linear regression and discuss how to leverage it to perform causal inference. Then, we explain under what conditions linear models tend to overfit or to generalize robustly to held-out data. Motivated by the threat of overfitting, we introduce regularization and ridge regression, and discuss sparse regression, where the goal is to fit a linear model that only depends on a small subset of the available features. Then, we introduce two popular linear models for binary and multiclass classification: Logistic and softmax regression. At this point, we turn our attention to nonlinear models. First, we present regression and classification trees and explain how to combine them via bagging, random forests, and boosting. Second, we explain how to train neural networks to perform regression and classification. Finally, we discuss how to evaluate classification models.

4 - Multiple Discrete Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 109-160
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter describes how to model multiple discrete quantities as discrete random variables within the same probability space and manipulate them using their joint pmf. We explain how to estimate the joint pmf from data, and use it to model precipitation in Oregon. Then, we introduce marginal distributions, which describe the individual behavior of each variable in a model, and conditional distributions, which describe the behavior of a variable when other variables are fixed. Next, we generalize the concepts of independence and conditional independence to random variables. In addition, we discuss the problem of causal inference, which seeks to identify causal relationships between variables. We then turn our attention to a fundamental challenge: It is impossible to completely characterize the dependence between all variables in a model, unless they are very few. This phenomenon, known as the curse of dimensionality, is the reason why independence assumptions are needed to make probabilistic models tractable. We conclude the chapter by describing two popular models based on such assumptions: Naive Bayes and Markov chains.

Appendix - Datasets
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 599-601
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Contents
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp vii-x
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Discrete and Continuous Variables
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 202-240
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This chapter discusses how to build probabilistic models that include both discrete and continuous variables. Mathematically, this is achieved by defining them as random variables within the same probability space. In practice, the variables are manipulated using their marginal and conditional distributions. We define the conditional pmf of a discrete random variable given a continuous variable, and the conditional probability density of a continuous random variable given a discrete variable. We use these objects to build mixture models and apply them to model height in a population. Next, we describe Gaussian discriminant analysis, a classification method based on mixture models with Gaussian conditional distributions, and apply it to diagnose Alzheimer's disease. Then, we explain how to perform clustering using Gaussian mixture models and leverage the approach to cluster NBA players. Finally, we introduce the framework of Bayesian statistics which enables us to explicitly encode our uncertainty about model parameters, and use it to analyze poll data from the 2020 United States presidential election.

Index
Carlos Fernandez-Granda, New York University
Book:

Probability and Statistics for Data Science

Published online:

19 June 2025

Print publication:

03 July 2025, pp 604-608
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Statistical theory and methods

Refine search

Refine search

Actions for selected content:

2351 results in Statistical theory and methods

Appendix C - Overview of Machine Learning Techniques

Summary

Part III - Applications

Contents

Terminology

3 - Characterizing and Preparing Network Data

Summary

9 - Deep Learning Methods

Summary

8 - Applications for Multiple Networks

Summary

Preface

5 - Multiple Continuous Variables

Summary

8 - Correlation

Summary

Dedication

10 - Hypothesis Testing

Summary

Frontmatter

9 - Estimation of Population Parameters

Summary

12 - Regression and Classification

Summary

4 - Multiple Discrete Variables

Summary

Appendix - Datasets

Contents

6 - Discrete and Continuous Variables

Summary

Index

Statistical theory and methods

Refine search

Refine search

Actions for selected content:

Save Search

2351 results in Statistical theory and methods

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary