To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Charles Stein shocked the statistical world in 1955 with his proof that maximum likelihood estimation methods for Gaussian models, in common use for more than a century, were inadmissible beyond simple one- or two-dimensional situations. These methods are still in use, for good reasons, but Stein-type estimators have pointed the way toward a radically different empirical Bayes approach to high-dimensional statistical inference. We will be using empirical Bayes ideas for estimation, testing, and prediction, beginning here with their path-breaking appearance in the James—Stein formulation.
Although the connection was not immediately recognized, Stein's work was half of an energetic post-war empirical Bayes initiative. The other half, explicitly named “empirical Bayes” by its principal developer Herbert Robbins, was less shocking but more general in scope, aiming to show how frequentists could achieve full Bayesian efficiency in large-scale parallel studies. Large-scale parallel studies were rare in the 1950s, however, and Robbins' theory did not have the applied impact of Stein's shrinkage estimators, which are useful in much smaller data sets.
All of this has changed in the 21st century. New scientific technologies, epitomized by the microarray, routinely produce studies of thousands of parallel cases — we will see several such studies in what follows — well-suited for the Robbins point of view. That view predominates in the succeeding chapters, though not explicitly invoking Robbins' methodology until the very last section of the book.
Microarray experiments, through a combination of insufficient data per gene and the difficulties of large-scale simultaneous inference, often yield disappointing results. In search of greater detection power, enrichment analysis considers the combined outcomes of biologically determined sets of genes, for example the set of all the genes in a predefined genetic pathway. If all 20 z-values in a hypothetical pathway were positive, we might assign significance to the pathway's effect, whether or not any of the individual zi were deemed non-null. We will consider enrichment methods in this chapter, and some of the theory, which of course applies just as well to similar situations outside the microarray context.
Our main example concerns the p53 data, partially illustrated in Figure 9.1; p53 is a transcription factor, that is, a gene that controls the activity of other genes. Mutations in p53 have been implicated in cancer development. A National Cancer Institute microarray study compared 33 mutated cell lines with 17 in which p53 was unmutated. There were N = 10 100 gene expressions measured for each cell line, yielding a 10 100 × 50 matrix X of expression measurements. Z-values based on two-sample t-tests were computed for each gene, as in (2.1)–(2.5), comparing mutated with unmutated cell lines. Figure 9.1 displays the 10 100 zi values.
At the risk of drastic oversimplification, the history of statistics as a recognized discipline can be divided into three eras:
The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple—Is treatment A better than treatment B? — but the new methods were suited to the kinds of small data sets individual scientists might collect.
The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind.
In classical significance testing, the null distribution plays the role of devil's advocate: a standard that the observed data must exceed in order to convince the scientific world that something interesting has occurred. We observe, say, z = 2, and note that in a hypothetical “long run” of observations from a N(0, 1) distribution less than 2.5% of the draws would exceed 2, thereby discrediting the uninteresting null distribution as an explanation.
Considerable effort has been expended trying to maintain the classical model in large-scale testing situations, as seen in Chapter 3, but there are important differences that affect the role of the null distribution when the number of cases N is large:
• With N = 10 000 for example, the statistician has his or her own “long run” in hand. This diminishes the importance of theoretical null calculations based on mathematical models. In particular, it may become clear that the classical null distribution appropriate for a single-test application is in fact wrong for the current situation.
• Scientific applications of single-test theory most often suppose, or hope for, rejection of the null hypothesis, perhaps with power = 0.80. Largescale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis, leaving only a small number of interesting prospects for more intensive investigation.
• Sharp null hypotheses, such as H0 : μ = 0 for z ˜ N(μ, 1), are less important in large-scale studies. […]
The quantitative analysis of biological sequence data is based on methods from statistics coupled with efficient algorithms from computer science. Algebra provides a framework for unifying many of the seemingly disparate techniques used by computational biologists. This book, first published in 2005, offers an introduction to this mathematical framework and describes tools from computational algebra for designing new algorithms for exact, accurate results. These algorithms can be applied to biological problems such as aligning genomes, finding genes and constructing phylogenies. The first part of this book consists of four chapters on the themes of Statistics, Computation, Algebra and Biology, offering speedy, self-contained introductions to the emerging field of algebraic statistics and its applications to genomics. In the second part, the four themes are combined and developed to tackle real problems in computational genomics. As the first book in the exciting and dynamic area, it will be welcomed as a text for self-study or for advanced undergraduate and beginning graduate courses.
Aimed at advanced undergraduate and graduate students in mathematics and related disciplines, this book presents the concepts and results underlying the Bayesian, frequentist and Fisherian approaches, with particular emphasis on the contrasts between them. Computational ideas are explained, as well as basic mathematical theory. Written in a lucid and informal style, this concise text provides both basic material on the main approaches to inference, as well as more advanced material on developments in statistical theory, including: material on Bayesian computation, such as MCMC, higher-order likelihood theory, predictive inference, bootstrap methods and conditional inference. It contains numerous extended examples of the application of formal inference techniques to real data, as well as historical commentary on the development of the subject. Throughout, the text concentrates on concepts, rather than mathematical detail, while maintaining appropriate levels of formality. Each chapter ends with a set of accessible problems.
Semiparametric regression is concerned with the flexible incorporation of non-linear functional relationships in regression analyses. Any application area that benefits from regression analysis can also benefit from semiparametric regression. Assuming only a basic familiarity with ordinary parametric regression, this user-friendly book explains the techniques and benefits of semiparametric regression in a concise and modular fashion. The authors make liberal use of graphics and examples plus case studies taken from environmental, financial, and other applications. They include practical advice on implementation and pointers to relevant software. The 2003 book is suitable as a textbook for students with little background in regression as well as a reference book for statistically oriented scientists such as biostatisticians, econometricians, quantitative social scientists, epidemiologists, with a good working knowledge of regression and the desire to begin using more flexible semiparametric models. Even experts on semiparametric regression should find something new here.
Quantile regression is gradually emerging as a unified statistical methodology for estimating models of conditional quantile functions. By complementing the exclusive focus of classical least squares regression on the conditional mean, quantile regression offers a systematic strategy for examining how covariates influence the location, scale and shape of the entire response distribution. This monograph is the first comprehensive treatment of the subject, encompassing models that are linear and nonlinear, parametric and nonparametric. The author has devoted more than 25 years of research to this topic. The methods in the analysis are illustrated with a variety of applications from economics, biology, ecology and finance. The treatment will find its core audiences in econometrics, statistics, and applied mathematics in addition to the disciplines cited above.
Experimental data can often be associated with or indexed by certain symmetrically interesting structures or sets of labels that appear, for example, in the study of short symbolic sequences in molecular biology, in preference or voting data, in (corneal) curvature data, and in studies of the handedness and entropy of symbolic sequences and elementary images. The symmetry studies introduced in this book describe the interplay among symmetry transformations that are characteristic of these sets of labels, their resulting classification, the algebraic decomposition of the data indexed by them, and the statistical analysis of the invariants induced by those decompositions. The overall purpose is to facilitate and guide the statistical study of the structured data from both a descriptive and inferential perspective. The text combines notions of algebra and statistics and develops a systematic methodology to better explore the interplay between symmetry-related research questions and their statistical analysis.
This up-to-date account of algebraic statistics and information geometry explores the emerging connections between the two disciplines, demonstrating how they can be used in design of experiments and how they benefit our understanding of statistical models, in particular, exponential models. This book presents a new way of approaching classical statistical problems and raises scientific questions that would never have been considered without the interaction of these two disciplines. Beginning with a brief introduction to each area, using simple illustrative examples, the book then proceeds with a collection of reviews and some new results written by leading researchers in their respective fields. Part III dwells in both classical and quantum information geometry, containing surveys of key results and new material. Finally, Part IV provides examples of the interplay between algebraic statistics and information geometry. Computer code and proofs are also available online, where key examples are developed in further detail.
Almost all the results available in the literature on multivariate t-distributions published in the last 50 years are now collected together in this comprehensive reference. Because these distributions are becoming more prominent in many applications, this book is a must for any serious researcher or consultant working in multivariate analysis and statistical distributions. Much of this material has never before appeared in book form. The first part of the book emphasizes theoretical results of a probabilistic nature. In the second part of the book, these are supplemented by a variety of statistical aspects. Various generalizations and applications are dealt with in the final chapters. The material on estimation and regression models is of special value for practitioners in statistics and economics. A comprehensive bibliography of over 350 references is included.