We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To send content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about sending content to .
To send content items to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.
Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
Statistical analysis of interesting datasets is conducted using computers. Various specialised computer programmes are available to facilitate statistical work. For using general statistical theory directly with custom-built models, R is probably the most usefully flexible of such programmes.
R (R Core Team, 2012) is a progamming language and environment designed for statistical analysis. It is free (see http://cran.r-project.org to obtain a copy) and is written and maintained by a community of statisticians. A major design feature is extendibility. R makes it very straightforward to code up statistical methods in a way that is easy to distribute and for others to use. The first place to look for information on getting started with R is http://cran.r-project.org/manuals.html. I will assume that you have installed R, can start it to obtain a command console, and have at least discovered the function q() for quitting R.
The following web resources provide excellent guides to the R language at different levels.
• http://cran.r-project.org/doc/contrib/Short-refcard.pdf is a four page summary of key functions and functionality.
• http://cran.r-project.org/doc/contrib/R_language.pdf is a very concise introduction to and reference for the structure of the language.
• http://cran.r-project.org/doc/manuals/R-lang.html is the main reference manual for the language.
A huge amount of statistical functionality is built into R and its extension packages, but the aim of this chapter is simply to give a brief overview of R as a statistical programming language.
Basic structure of R
When you start R (interactively) two important things are created: a command prompt at which to type commands telling R what to do, and an environment, known interchangeably as the ‘global environment’ or ‘user workspace’ to hold the objects created by your commands. Unlike the command prompt, you do not see the global environment directly, but it is there as an extendible chunk of computer memory for holding your data, commands and other objects.
Generically in R an ‘environment’ consists of two things. The first, known in R jargon as a frame, is a set of symbols used to refer to objects, along with the data defining those objects.
Statistics aims to extract information from data: specifically, information about the system that generated the data. There are two difficulties with this enterprise. First, it may not be easy to infer what we want to know from the data that can be obtained. Second, most data contain a component of random variability: if we were to replicate the data-gathering process several times we would obtain somewhat different data on each occasion. In the face of such variability, how do we ensure that the conclusions drawn from a single set of data are generally valid, and not a misleading reflection of the random peculiarities of that single set of data?
Statistics provides methods for overcoming these difficulties and making sound inferences from inherently random data. For the most part this involves the use of statistical models, which are like ‘mathematical cartoons’ describing how our data might have been generated, if the unknown features of the data-generating system were actually known. So if the unknowns were known, then a decent model could generate data that resembled the observed data, including reproducing its variability under replication. The purpose of statistical inference is then to use the statistical model to go in the reverse direction: to infer the values of the model unknowns that are consistent with observed data.
Mathematically, let y denote a random vector containing the observed data. Let θ denote a vector of parameters of unknown value. We assume that knowing the values of some of these parameters would answer the questions of interest about the system generating y. So a statistical model is a recipe by which y might have been generated, given appropriate values for θ. At a minimum the model specifies how data like y might be simulated, thereby implicitly defining the distribution of y and how it depends on θ. Often it will provide more, by explicitly defining the p.d.f. of y in terms of θ.
This book is aimed at the numerate reader who has probably taken an introductory statistics and probability course at some stage and would like a brief introduction to the core methods of statistics and how they are applied, not necessarily in the context of standard models. The first chapter is a brief review of some basic probability theory needed for what follows. Chapter 2 discusses statistical models and the questions addressed by statistical inference and introduces the maximum likelihood and Bayesian approaches to answering them. Chapter 3 is a short overview of the R programming language. Chapter 4 provides a concise coverage of the large sample theory of maximum likelihood estimation, and Chapter 5 discusses the numerical methods required to use this theory. Chapter 6 covers the numerical methods useful for Bayesian computation, in particular Markov chain Monte Carlo. Chapter 7 provides a brief tour of the theory and practice of linear modelling. Appendices then cover some useful information on common distributions, matrix computation and random number generation. The book is neither an encyclopedia nor a cookbook, and the bibliography aims to provide a compact list of the most useful sources for further reading, rather than being extensive. The aim is to offer a concise coverage of the core knowledge needed to understand and use parametric statistical methods and to build new methods for analysing data. Modern statistics exists at the interface between computation and theory, and this book reflects that fact. I am grateful to Nicole Augustin, Finn Lindgren, the editors at Cambridge University Press and the students on the Bath course ‘Applied Statistical Inference’ and the Academy for PhD Training in Statistics course ‘Statistical Computing’ for many useful comments.
Random variables
Statistics is about extracting information from data that contain an inherently unpredictable component. Random variables are the mathematical construct used to build models of such variability. A random variable takes a different value, at random, each time it is observed. We cannot say, in advance, exactly what value will be taken, but we can make probability statements about the values likely to occur. That is, we can characterise the distribution of values taken by a random variable. This chapter briefly reviews the technical constructs used for working with random variables, as well as a number of generally useful related results. See De Groot and Schervish (2002) or Grimmett and Stirzaker (2001) for fuller introductions.
Cumulative distribution functions
The cumulative distribution function (c.d.f.) of a random variable (r.v.), X, is the function F(x) such that
F(x) = Pr(X ≤ x).
That is, F(x) gives the probability that the value of X will be less than or equal to x. Obviously, F(−∞) = 0, F(∞) = 1 and F(x) is monotonic. A useful consequence of this definition is that if F is continuous then F(x) has a uniform distribution on [0, 1]: it takes any value between 0 and 1 with equal probability. This is because
Pr(X ≤ x) = Pr{F(x) ≤ F(x)} = F(x) ⇒ Pr{F(x) ≤ u} = u
(if F is continuous), the latter being the c.d.f. of a uniform r.v. on [0, 1].
Define the inverse of the c.d.f. as F− (u) = min(x|F(x) ≥ u), which is just the usual inverse function of F if F is continuous. F− is often called the quantile function of X. If U has a uniform distribution on [0, 1], then F− (U) is distributed as X with c.d.f. F. Given some way of generating uniform random deviates, this provides a method for generating random variables from any distribution with a computable F−.
Chapter 6, in particular, took it for granted that we can produce random numbers from various distributions. Actually we can't. The best that can be done is to produce a completely deterministic sequence of numbers that appears indistinguishable from a random sequence with respect to any relevant statistical property that we choose to test. In other words, we may be able to produce a deterministic sequence of numbers that can be very well modelled as being a random sequence from some distribution. Such deterministic sequences are referred to as sequences of pseudorandom numbers, but the pseudo part usually gets dropped at some point.
The fundamental problem, for our purposes, is to generate a pseudorandom sequence that can be extremely well modelled as i.i.d. U(0, 1). Given such a sequence, it is fairly straightforward to generate deviates from other distributions, but the i.i.d. U(0, 1) generation is where the problems lie. Indeed if you read around this topic, most books will largely agree about how to turn uniform random deviates into deviates from a huge range of other distributions, but advice on how to obtain the uniform deviates in the first place is much less consistent.
Simple generators and what can go wrong
Since the 1950s there has been much work on linear congruential generators. The intuitive motivation is something like this. Suppose I take an integer, multiply it by some enormous factor, rewrite it in base – ‘something huge’, and then throw away everything except for the digits after the decimal point. Pretty hard to predict the result, no? So, if I repeat the operation, feeding each step's output into the input for the next step, a more or less random sequence might result. Formally the pseudorandom sequence is defined by
Xi+1 = (aXi + b)modM,
where b is 0 or 1, in practice. This is started with a seed X0.
Email your librarian or administrator to recommend adding this to your organisation's collection.