Search results for Statistical theory and methods

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Jianfeng Yao, Shurong Zheng, Zhidong Bai
Published online:

05 April 2015

Print publication:

26 March 2015
- Book
- Export citation
High-dimensional data appear in many fields, and their analysis has become increasingly important in modern statistics. However, it has long been observed that several well-known methods in multivariate analysis become inefficient, or even misleading, when the data dimension p is larger than, say, several tens. A seminal example is the well-known inefficiency of Hotelling's T2-test in such cases. This example shows that classical large sample limits may no longer hold for high-dimensional data; statisticians must seek new limiting theorems in these instances. Thus, the theory of random matrices (RMT) serves as a much-needed and welcome alternative framework. Based on the authors' own research, this book provides a firsthand introduction to new high-dimensional statistical methods derived from RMT. The book begins with a detailed introduction to useful tools from RMT, and then presents a series of high-dimensional problems with solutions provided by RMT methods.

Appendix A - Some distributions
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 213-218
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Frontmatter
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Appendix B - Matrix computation
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 219-232
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

3 - R
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 49-78
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Statistical analysis of interesting datasets is conducted using computers. Various specialised computer programmes are available to facilitate statistical work. For using general statistical theory directly with custom-built models, R is probably the most usefully flexible of such programmes.
R (R Core Team, 2012) is a progamming language and environment designed for statistical analysis. It is free (see http://cran.r-project.org to obtain a copy) and is written and maintained by a community of statisticians. A major design feature is extendibility. R makes it very straightforward to code up statistical methods in a way that is easy to distribute and for others to use. The first place to look for information on getting started with R is http://cran.r-project.org/manuals.html. I will assume that you have installed R, can start it to obtain a command console, and have at least discovered the function q() for quitting R.
The following web resources provide excellent guides to the R language at different levels.
• http://cran.r-project.org/doc/contrib/Short-refcard.pdf is a four page summary of key functions and functionality.
• http://cran.r-project.org/doc/contrib/R_language.pdf is a very concise introduction to and reference for the structure of the language.
• http://cran.r-project.org/doc/manuals/R-lang.html is the main reference manual for the language.
A huge amount of statistical functionality is built into R and its extension packages, but the aim of this chapter is simply to give a brief overview of R as a statistical programming language.
Basic structure of R
When you start R (interactively) two important things are created: a command prompt at which to type commands telling R what to do, and an environment, known interchangeably as the ‘global environment’ or ‘user workspace’ to hold the objects created by your commands. Unlike the command prompt, you do not see the global environment directly, but it is there as an extendible chunk of computer memory for holding your data, commands and other objects.
Generically in R an ‘environment’ consists of two things. The first, known in R jargon as a frame, is a set of symbols used to refer to objects, along with the data defining those objects.

Preface
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp viii-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

This book is aimed at the numerate reader who has probably taken an introductory statistics and probability course at some stage and would like a brief introduction to the core methods of statistics and how they are applied, not necessarily in the context of standard models. The first chapter is a brief review of some basic probability theory needed for what follows. Chapter 2 discusses statistical models and the questions addressed by statistical inference and introduces the maximum likelihood and Bayesian approaches to answering them. Chapter 3 is a short overview of the R programming language. Chapter 4 provides a concise coverage of the large sample theory of maximum likelihood estimation, and Chapter 5 discusses the numerical methods required to use this theory. Chapter 6 covers the numerical methods useful for Bayesian computation, in particular Markov chain Monte Carlo. Chapter 7 provides a brief tour of the theory and practice of linear modelling. Appendices then cover some useful information on common distributions, matrix computation and random number generation. The book is neither an encyclopedia nor a cookbook, and the bibliography aims to provide a compact list of the most useful sources for further reading, rather than being extensive. The aim is to offer a concise coverage of the core knowledge needed to understand and use parametric statistical methods and to build new methods for analysing data. Modern statistics exists at the interface between computation and theory, and this book reflects that fact. I am grateful to Nicole Augustin, Finn Lindgren, the editors at Cambridge University Press and the students on the Bath course ‘Applied Statistical Inference’ and the Academy for PhD Training in Statistics course ‘Statistical Computing’ for many useful comments.

7 - Linear models
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 184-212
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

4 - Theory of maximum likelihood estimation
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 79-88
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

1 - Random variables
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 1-18
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Random variables
Statistics is about extracting information from data that contain an inherently unpredictable component. Random variables are the mathematical construct used to build models of such variability. A random variable takes a different value, at random, each time it is observed. We cannot say, in advance, exactly what value will be taken, but we can make probability statements about the values likely to occur. That is, we can characterise the distribution of values taken by a random variable. This chapter briefly reviews the technical constructs used for working with random variables, as well as a number of generally useful related results. See De Groot and Schervish (2002) or Grimmett and Stirzaker (2001) for fuller introductions.
Cumulative distribution functions
The cumulative distribution function (c.d.f.) of a random variable (r.v.), X, is the function F(x) such that
F(x) = Pr(X ≤ x).
That is, F(x) gives the probability that the value of X will be less than or equal to x. Obviously, F(−∞) = 0, F(∞) = 1 and F(x) is monotonic. A useful consequence of this definition is that if F is continuous then F(x) has a uniform distribution on [0, 1]: it takes any value between 0 and 1 with equal probability. This is because
Pr(X ≤ x) = Pr{F(x) ≤ F(x)} = F(x) ⇒ Pr{F(x) ≤ u} = u
(if F is continuous), the latter being the c.d.f. of a uniform r.v. on [0, 1].
Define the inverse of the c.d.f. as F− (u) = min(x|F(x) ≥ u), which is just the usual inverse function of F if F is continuous. F− is often called the quantile function of X. If U has a uniform distribution on [0, 1], then F− (U) is distributed as X with c.d.f. F. Given some way of generating uniform random deviates, this provides a method for generating random variables from any distribution with a computable F−.

Index
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 245-250
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Appendix C - Random number generation
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 233-240
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Chapter 6, in particular, took it for granted that we can produce random numbers from various distributions. Actually we can't. The best that can be done is to produce a completely deterministic sequence of numbers that appears indistinguishable from a random sequence with respect to any relevant statistical property that we choose to test. In other words, we may be able to produce a deterministic sequence of numbers that can be very well modelled as being a random sequence from some distribution. Such deterministic sequences are referred to as sequences of pseudorandom numbers, but the pseudo part usually gets dropped at some point.
The fundamental problem, for our purposes, is to generate a pseudorandom sequence that can be extremely well modelled as i.i.d. U(0, 1). Given such a sequence, it is fairly straightforward to generate deviates from other distributions, but the i.i.d. U(0, 1) generation is where the problems lie. Indeed if you read around this topic, most books will largely agree about how to turn uniform random deviates into deviates from a huge range of other distributions, but advice on how to obtain the uniform deviates in the first place is much less consistent.
Simple generators and what can go wrong
Since the 1950s there has been much work on linear congruential generators. The intuitive motivation is something like this. Suppose I take an integer, multiply it by some enormous factor, rewrite it in base – ‘something huge’, and then throw away everything except for the digits after the decimal point. Pretty hard to predict the result, no? So, if I repeat the operation, feeding each step's output into the input for the next step, a more or less random sequence might result. Formally the pseudorandom sequence is defined by
Xi+1 = (aXi + b)modM,
where b is 0 or 1, in practice. This is started with a seed X0.

Contents
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp v-vii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

References
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 241-244
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

2 - Statistical models and inference
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 19-48
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Statistics aims to extract information from data: specifically, information about the system that generated the data. There are two difficulties with this enterprise. First, it may not be easy to infer what we want to know from the data that can be obtained. Second, most data contain a component of random variability: if we were to replicate the data-gathering process several times we would obtain somewhat different data on each occasion. In the face of such variability, how do we ensure that the conclusions drawn from a single set of data are generally valid, and not a misleading reflection of the random peculiarities of that single set of data?
Statistics provides methods for overcoming these difficulties and making sound inferences from inherently random data. For the most part this involves the use of statistical models, which are like ‘mathematical cartoons’ describing how our data might have been generated, if the unknown features of the data-generating system were actually known. So if the unknowns were known, then a decent model could generate data that resembled the observed data, including reproducing its variability under replication. The purpose of statistical inference is then to use the statistical model to go in the reverse direction: to infer the values of the model unknowns that are consistent with observed data.
Mathematically, let y denote a random vector containing the observed data. Let θ denote a vector of parameters of unknown value. We assume that knowing the values of some of these parameters would answer the questions of interest about the system generating y. So a statistical model is a recipe by which y might have been generated, given appropriate values for θ. At a minimum the model specifies how data like y might be simulated, thereby implicitly defining the distribution of y and how it depends on θ. Often it will provide more, by explicitly defining the p.d.f. of y in terms of θ.

6 - Bayesian computation
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 133-183
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

5 - Numerical maximum likelihood estimation
Simon N. Wood, University of Bath
Book:

Core Statistics

Published online:

05 April 2015

Print publication:

02 April 2015, pp 89-132
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Classification of Data
Jianfeng Yao, The University of Hong Kong, Shurong Zheng, Zhidong Bai
Book:

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Published online:

05 April 2015

Print publication:

26 March 2015, pp 75-104
- Chapter
- Export citation

Frontmatter
Jianfeng Yao, The University of Hong Kong, Shurong Zheng, Zhidong Bai
Book:

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Published online:

05 April 2015

Print publication:

26 March 2015, pp i-iv
- Chapter
- Export citation

1 - Introduction
Jianfeng Yao, The University of Hong Kong, Shurong Zheng, Zhidong Bai
Book:

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Published online:

05 April 2015

Print publication:

26 March 2015, pp 1-6
- Chapter
- Export citation
Summary

Large-Dimensional Data and New Asymptotic Statistics
In a multivariate analysis problem, we are given a sample x1, x2, …, xn of random observations of dimension p. Statistical methods, such as principal component analysis, have been developed since the beginning of the 20th century. When the observations are Gaussian, some nonasymptotic methods exist, such as Student's test, Fisher's test, or the analysis of variance. However, in most applications, observations are non-Gaussian, at least in part, so that nonasymptotic results become hard to obtain and statistical methods are built using limiting theorems on model statistics.
Most of these asymptotic results are derived under the assumption that the data dimension p is fixed while the sample size n tends to infinity (large sample theory). This theory had been adopted by most practitioners until very recently, when they were faced with a new challenge: the analysis of large dimensional data.
Large-dimensional data appear in various fields for different reasons. In finance, as a consequence of the generalisation of Internet and electronic commerce supported by the exponentially increasing power of computing, online data from markets around the world are accumulated on a giga-octet basis every day. In genetic experiments, such as micro-arrays, it becomes possible to record the expression of several thousand of genes from a single tissue. Table 1.1 displays some typical data dimensions and sample sizes. We can see from this table that the data dimension p is far from the “usual” situations where p is commonly less than 10. We refer to this new type of data as large-dimensional data.
It has been observed for a long time that several well-known methods in multivariate analysis become inefficient or even misleading when the data dimension p is not as small as, say, several tens. A seminal example was provided by Dempster in 1958, when he established the inefficiency of Hotelling's T2 in such cases and provided a remedy (named a non-exact test). However, by that time, no statistician was able to discover the fundamental reasons for such a breakdown in the well-established methods.

Appendix A - Curvilinear Integrals
Jianfeng Yao, The University of Hong Kong, Shurong Zheng, Zhidong Bai
Book:

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Published online:

05 April 2015

Print publication:

26 March 2015, pp 291-298
- Chapter
- Export citation

Statistical theory and methods

Refine search

Refine search

Actions for selected content:

2348 results in Statistical theory and methods

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Appendix A - Some distributions

Frontmatter

Appendix B - Matrix computation

3 - R

Summary

Preface

Summary

7 - Linear models

4 - Theory of maximum likelihood estimation

1 - Random variables

Summary

Index

Appendix C - Random number generation

Summary

Contents

References

2 - Statistical models and inference

Summary

6 - Bayesian computation

5 - Numerical maximum likelihood estimation

6 - Classification of Data

Frontmatter

1 - Introduction

Summary

Appendix A - Curvilinear Integrals

Statistical theory and methods

Refine search

Refine search

Actions for selected content:

Save Search

2348 results in Statistical theory and methods

Large Sample Covariance Matrices and High-Dimensional Data Analysis

Summary

Summary

Summary

Summary

Summary

Summary