To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 5 highlights the rich dependence structures that can be captured via Bayesian nonparametric models using multistage (nonparametric) hierarchies illustrated graphically in Figure 5.2. The applications they present are impressive and we can see real operational benefits provided by the careful specification of additional layers of dependences. In this companion chapter we examine in detail some of the related key issues, including computational challenges and the use of de Finetti's representation theorem.
Introduction
Hierarchical models have played a central role in Bayesian inference since the time of Good (1965) with Lindley and Smith (1972) a key milestone providing the first comprehensive treatment of hierarchical priors for the parametric Bayes linear model. The Bayesian hierarchies have proved so popular because they provide a natural framework for “borrowing of strength” (a term apparently due to Tukey) or sharing partial information across components through the hierarchical structure. The Bayesian construction also provides a clear distinction from frequentist models in that the dependence structures do not have to be directly related to population random effects.
In this chapter we will look a little more closely at a couple of key issues emerging from the work of Teh and Jordan. First, we briefly step back a little and consider the fundamental notion played by de Finetti's representation theorem, reiterating the operational focus of Bayesian modeling on finite-dimensional specification of joint distributions on observables.
Bayesian nonparametric inference is a relatively young area of research and it has recently undergone a strong development. Most of its success can be explained by the considerable degree of flexibility it ensures in statistical modeling, if compared to parametric alternatives, and by the emergence of new and efficient simulation techniques that make nonparametric models amenable to concrete use in a number of applied statistical problems. This fast growth is witnessed by some review articles and monographs providing interesting and accurate accounts of the state of the art in Bayesian nonparametrics. Among them we mention the discussion paper by Walker, Damien, Laud and Smith (1999), the book by Ghosh and Ramamoorthi (2003), the lecture notes by Regazzini (2001) and the review articles by Hjort (2003) and Müller and Quintana (2004). Here we wish to provide an update to all these excellent works. In particular, we focus on classes of nonparametric priors that go beyond the Dirichlet process.
Introduction
The Dirichlet process has been a cornerstone in Bayesian nonparametrics since the seminal paper by T. S. Ferguson appeared in the Annals of Statistics in 1973. Its success can be partly explained by its mathematical tractability and it has grown tremendously with the development of Markov chain Monte Carlo (MCMC) techniques whose implementation allows a full Bayesian analysis of complex statistical models based on the Dirichlet process prior.
It is now possible to demonstrate many applications of Bayesian nonparametric methods. It works. It is clear, however, that nonparametric methods are more complicated to understand, use and derive conclusions from, when compared to their parametric counterparts. For this reason it is imperative to provide specific and comprehensive motivation for using nonparametric methods. This chapter aims to do this, and the discussions in this part are restricted to the case of independent and identically distributed (i.i.d.) observations. Although such types of observation are quite specific, the arguments and ideas laid out in this chapter can be extended to cover more complicated types of observation. The usefulness in discussing i.i.d. observations is that the maths is simplified.
Introduction
Even though there is no physical connection between observations, there is a real and obvious reason for creating a dependence between them from a modeling perspective. The first observation, say X1, provides information about the unknown density f from which it came, which in turn provides information about the second observation X2, and so on. How a Bayesian learns is her choice but it is clear that with i.i.d. observations the order of learning should not matter and hence we enter the realms of exchangeable learning models. The mathematics is by now well known (de Finetti, 1937; Hewitt and Savage, 1955) and involves the construction of a prior distribution Π(d f) on a suitable space of density functions.
This introduction explains why you are right to be curious about Bayesian nonparametrics – why you may actually need it and how you can manage to understand it and use it. We also give an overview of the aims and contents of this book and how it came into existence, delve briefly into the history of the still relatively young field of Bayesian nonparametrics, and offer some concluding remarks about challenges and likely future developments in the area.
Bayesian nonparametrics
As modern statistics has developed in recent decades various dichotomies, where pairs of approaches are somehow contrasted, have become less sharp than they appeared to be in the past. That some border lines appear more blurred than a generation or two ago is also evident for the contrasting pairs “parametric versus nonparametric” and “frequentist versus Bayes.” It appears to follow that “Bayesian nonparametrics” cannot be a very well-defined body of methods.
What is it all about?
It is nevertheless an interesting exercise to delineate the regions of statistical methodology and practice implied by constructing a two-by-two table of sorts, via the two “factors” parametric–nonparametric and frequentist–Bayes; Bayesian nonparametrics would then be whatever is not found inside the other three categories.
(i) Frequentist parametrics encompasses the core of classical statistics, involving methods associated primarily with maximum likelihood, developed in the 1920s and onwards. Such methods relate to various optimum tests, with calculation of p-values, optimal estimators, confidence intervals, multiple comparisons, and so forth.
This chapter provides a brief review and motivation for the use of nonparametric Bayes methods in biostatistical applications. Clearly, the nonparametric Bayes biostatistical literature is increasingly vast, and it is not possible to present properly or even mention most of the approaches that have been proposed. Instead, the focus here is entirely on methods utilizing random probability measures, with the emphasis on a few approaches that seem particularly useful in addressing the considerable challenges faced in modern biostatistical research. In addition, the emphasis will be entirely on practical applications-motivated considerations, with the goal of moving the reader towards implementing related approaches for their own data. Readers interested in the theoretical motivation, which is certainly a fascinating area in itself, are referred to the cited papers and to Chapters 1–4.
Introduction
Biomedical research has clearly evolved at a dramatic rate in the past decade, with improvements in technology leading to a fundamental shift in the way in which data are collected and analyzed. Before this paradigm shift, studies were most commonly designed to be simple and to focus on relationships among a few variables of primary interest. For example, in a clinical trial, patients may be randomized to receive either the drug or placebo, with the analysis focusing on a comparison of means between the two groups. However, with emerging biotechnology tools, scientists are increasingly interested in studying how patients vary in their response to drug therapies, and what factors predict this variability.
In this companion to Chapter 7 we discuss and extend some of the models and inference approaches introduced there. We elaborate on the discussion of random partition priors implied by the Dirichlet process. We review some additional variations of dependent Dirichlet process models and we review in more detail the Pólya tree prior used briefly in Chapter 7. Finally, we review variations of Dirichlet process models for data formats beyond continuous responses.
Introduction
In Chapter 7, Dunson introduced many interesting applications of nonparametric priors for inference in biomedical problems. The focus of the discussion was on Dirichlet process (DP) priors and variations. While the DP prior defines a probability model for a (discrete) random probability distribution G, the primary objective of inference in many recent applications is not inference on G. Instead many applications of the DP prior exploit the random partition of the Pólya urn scheme that is implied by the configuration of ties among the random draws from a discrete measure with DP prior. When the emphasis is on inference for the clustering, it is helpful to recognize the DP as a special case of more general clustering models. In particular we will review the product partition (PPM) and species sampling models (SSM). We discuss these models in Section 8.2. A definition and discussion of the SSM as a random probability measure also appears in Section 3.3.4. Another useful characterization of the DP is as a special case of the Pólya tree (PT) prior.
Here we review the role of the Dirichlet process and related prior distribtions in nonparametric Bayesian inference. We discuss construction and various properties of the Dirichlet process. We then review the asymptotic properties of posterior distributions. Starting with the definition of posterior consistency and examples of inconsistency, we discuss general theorems which lead to consistency. We then describe the method of calculating posterior convergence rates and briefly outline how such rates can be computed in nonparametric examples. We also discuss the issue of posterior rate adaptation, Bayes factor consistency in model selection and Bernshteĭn–von Mises type theorems for nonparametric problems.
Introduction
Making inferences from observed data requires modeling the data-generating mechanism. Often, owing to a lack of clear knowledge about the data-generating mechanism, we can only make very general assumptions, leaving a large portion of the mechanism unspecified, in the sense that the distribution of the data is not specified by a finite number of parameters. Such nonparametric models guard against possible gross misspecification of the data-generating mechanism, and are quite popular, especially when adequate amounts of data can be collected. In such cases, the parameters can be best described by functions, or some infinite-dimensional objects, which assume the role of parameters. Examples of such infinite-dimensional parameters include the cumulative distribution function (c.d.f.), density function, nonparametric regression function, spectral density of a time series, unknown link function in a generalized linear model, transition density of a Markov chain and so on.
Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that parameters are endowed with distributions which may themselves introduce new parameters, and this construction recurses. In this review we discuss the role of hierarchical modeling in Bayesian nonparametrics, focusing on models in which the infinite-dimensional parameters are treated hierarchically. For example, we consider a model in which the base measure for a Dirichlet process is itself treated as a draw from another Dirichlet process. This yields a natural recursion that we refer to as a hierarchical Dirichlet process. We also discuss hierarchies based on the Pitman–Yor process and on completely random processes. We demonstrate the value of these hierarchical constructions in a wide range of practical applications, in problems in computational biology, computer vision and natural language processing.
Introduction
Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that parameters are endowed with distributions which may themselves introduce new parameters, and this construction recurses. A common motif in hierarchical modeling is that of the conditionally independent hierarchy, in which a set of parameters are coupled by making their distributions depend on a shared underlying parameter. These distributions are often taken to be identical, based on an assertion of exchangeability and an appeal to de Finetti's theorem.
Anderson (1984, p. 2) discusses the inadequacy of the ordered choice model we have examined thus far:
We argue here that the class of regression models currently available for ordered categorical response variables is not wide enough to cover the range of problems that arise in practice. Factors affecting the kind of regression model required are (i) the type of ordered categorical variable, (ii) the observer error process and (iii) the “dimensionality” of the regression relationship. These factors relate to the processes giving rise to the observations and have been rather neglected in the literature.
Generalizations of the model, e.g., Williams (2006), have been predicated on Anderson's observations, as well as some observed features in data being analyzed and the underlying data-generating processes.
It is useful to distinguish between two directions of the contemporary development of the ordered choice model. Although it hints at some subtle aspects of the model (underlying data-generating process), Anderson's argument directs attention primarily to the functional form of the model and its inadequacy in certain situations. Beginning with Terza (1985), a number of authors have focused instead on the fact that the model does not account adequately for individual heterogeneity that is likely to be present in micro-level data. This chapter will consider the first of these. Heterogeneity is examined in Chapter 7.