To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Applied statistics is an inherently conservative enterprise, and appropriately so since the scientific world depends heavily on the consistent evaluation of evidence. Conservative consistency is raised to its highest level in classical significance testing, where the control of Type I error is enforced with an almost religious intensity. A p-value of 0.06 rather than 0.04 has decided the fate of entire pharmaceutical companies. Fisher's scale of evidence, Table 3.1, particularly the α = 0.05 threshold, has been used in literally millions of serious scientific studies, and stakes a good claim to being the 20th century's most influential piece of applied mathematics.
All of this makes it more than a little surprising that a powerful rival to Type I error control has emerged in the large-scale testing literature. Since its debut in Benjamini and Hochberg's seminal 1995 paper, false discovery rate control has claimed an increasing portion of statistical research, both applied and theoretical, and seems to have achieved “accepted methodology” status in scientific subject-matter journals.
False discovery rate control moves us away from the significance-testing algorithms of Chapter 3, back toward the empirical Bayes context of Chapter 2. The language of classical testing is often used to describe FDR methods (perhaps in this way assisting their stealthy infiltration of multiple testing practice), but, as the discussion here is intended to show, both their rationale and results are quite different.
Simultaneous hypothesis testing was a lively topic in the early 1960s, my graduate student years, and had been so since the end of World War II. Rupert Miller's book Simultaneous Statistical Inference appeared in 1966, providing a beautifully lucid summary of the contemporary methodology. A second edition in 1981 recorded only modest gains during the intervening years. This was a respite, not an end: a new burst of innovation in the late 1980s generated important techniques that we will be revisiting in this chapter.
Miller's book, which gives a balanced picture of the theory of that time, has three notable features:
It is overwhelmingly frequentist.
It is focused on control of α, the overall Type I error rate of a procedure.
It is aimed at multiple testing situations with individual cases N between 2 and, say, 10.
We have now entered a scientific age in which N = 10 000 is no cause for raised eyebrows. It is impressive (or worrisome) that the theory of the 1980s continues to play a central role in microarray-era statistical inference. Features 1 and 2 are still the norm in much of the multiple testing literature, despite the obsolescence of Feature 3. This chapter reviews part of that theory, particularly the ingenious algorithms that have been devised to control the overall Type I error rate (also known as FWER, the family-wise error rate).
Charles Stein shocked the statistical world in 1955 with his proof that maximum likelihood estimation methods for Gaussian models, in common use for more than a century, were inadmissible beyond simple one- or two-dimensional situations. These methods are still in use, for good reasons, but Stein-type estimators have pointed the way toward a radically different empirical Bayes approach to high-dimensional statistical inference. We will be using empirical Bayes ideas for estimation, testing, and prediction, beginning here with their path-breaking appearance in the James—Stein formulation.
Although the connection was not immediately recognized, Stein's work was half of an energetic post-war empirical Bayes initiative. The other half, explicitly named “empirical Bayes” by its principal developer Herbert Robbins, was less shocking but more general in scope, aiming to show how frequentists could achieve full Bayesian efficiency in large-scale parallel studies. Large-scale parallel studies were rare in the 1950s, however, and Robbins' theory did not have the applied impact of Stein's shrinkage estimators, which are useful in much smaller data sets.
All of this has changed in the 21st century. New scientific technologies, epitomized by the microarray, routinely produce studies of thousands of parallel cases — we will see several such studies in what follows — well-suited for the Robbins point of view. That view predominates in the succeeding chapters, though not explicitly invoking Robbins' methodology until the very last section of the book.
Microarray experiments, through a combination of insufficient data per gene and the difficulties of large-scale simultaneous inference, often yield disappointing results. In search of greater detection power, enrichment analysis considers the combined outcomes of biologically determined sets of genes, for example the set of all the genes in a predefined genetic pathway. If all 20 z-values in a hypothetical pathway were positive, we might assign significance to the pathway's effect, whether or not any of the individual zi were deemed non-null. We will consider enrichment methods in this chapter, and some of the theory, which of course applies just as well to similar situations outside the microarray context.
Our main example concerns the p53 data, partially illustrated in Figure 9.1; p53 is a transcription factor, that is, a gene that controls the activity of other genes. Mutations in p53 have been implicated in cancer development. A National Cancer Institute microarray study compared 33 mutated cell lines with 17 in which p53 was unmutated. There were N = 10 100 gene expressions measured for each cell line, yielding a 10 100 × 50 matrix X of expression measurements. Z-values based on two-sample t-tests were computed for each gene, as in (2.1)–(2.5), comparing mutated with unmutated cell lines. Figure 9.1 displays the 10 100 zi values.
At the risk of drastic oversimplification, the history of statistics as a recognized discipline can be divided into three eras:
The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple—Is treatment A better than treatment B? — but the new methods were suited to the kinds of small data sets individual scientists might collect.
The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind.
In classical significance testing, the null distribution plays the role of devil's advocate: a standard that the observed data must exceed in order to convince the scientific world that something interesting has occurred. We observe, say, z = 2, and note that in a hypothetical “long run” of observations from a N(0, 1) distribution less than 2.5% of the draws would exceed 2, thereby discrediting the uninteresting null distribution as an explanation.
Considerable effort has been expended trying to maintain the classical model in large-scale testing situations, as seen in Chapter 3, but there are important differences that affect the role of the null distribution when the number of cases N is large:
• With N = 10 000 for example, the statistician has his or her own “long run” in hand. This diminishes the importance of theoretical null calculations based on mathematical models. In particular, it may become clear that the classical null distribution appropriate for a single-test application is in fact wrong for the current situation.
• Scientific applications of single-test theory most often suppose, or hope for, rejection of the null hypothesis, perhaps with power = 0.80. Largescale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis, leaving only a small number of interesting prospects for more intensive investigation.
• Sharp null hypotheses, such as H0 : μ = 0 for z ˜ N(μ, 1), are less important in large-scale studies. […]