Nonparametric Bayesian methods have seen rapid and sustained growth over the past 25 years. We present a gentle introduction to the methods, motivating the methods through the twin perspectives of consistency and false consistency. We then step through the various constructions of the Dirichlet process, outline a number of the basic properties of this process and move on to the mixture of Dirichlet processes model, including a quick discussion of the computational methods used to fit the model. We touch on the main philosophies for nonparametric Bayesian data analysis and then reanalyze a famous data set. The reanalysis illustrates the concept of admissibility through a novel perturbation of the problem and data, showing the benefit of shrinkage estimation and the much greater benefit of nonparametric Bayesian modelling. We conclude with a too-brief survey of fancier nonparametric Bayesian methods.
There are many ways to use data, whether experimental or observational, to better understand the world and to make better decisions. The Bayesian approach distinguishes itself from other approaches with two distinct sources of sound foundational support. The first is the theory of subjective probability, developed from a set of axioms that describe rational behavior. Subjective probability provides an alternative to the relative frequency definition of probability. Under subjective probability, individuals are free to have their own assessments of probabilities. This theory leads inexorably to Bayesian methods (Savage, 1954). The second source of support is decision theory which formalizes statistical inference as a decision problem. The combination of state-of-nature (parameter) and action (say, an estimate) yield a loss, and a good inference procedure (decision rule) leads to a small expected loss. Nearly all agree that inadmissible inference procedures are to be avoided. The complete class theorems show that the entire set of admissible inference procedures is comprised of Bayesian procedures and of procedures that are close to Bayesian in a technical sense (Berger, 1985). Procedures that are far from Bayesian can be useful, but they must be justified on special grounds–for example, our inability to discover a dominating procedure, our inability to implement the dominating procedure due to computational limitations, or to address robustness issues, perhaps due to the shortcomings of our formal mathematical model.
The twin perspectives of subjective probability and decision theory have convinced many that statistical inference should be driven by Bayesian methods. However, neither of these perspectives describes
Consistency is a fundamental principle of statistical inference. The simplest description of consistency is for a sequence of increasingly large random samples arising from a distribution. With such data, our inference ideally settles on the true distribution. This is captured through the usual definitions of consistency from classical statistics, where a consistent estimator of the parameter governing the distribution converges to the true parameter value in some sense.
The traditional definition of consistency has its parallel for subjective probabilists. Here, for consistency, the posterior distribution concentrates in each arbitrarily small neighborhood of the true parameter value. That is, for any given
Bayesian purists have argued that Bayesian methods are always consistent–that the methods naturally lead to consistent estimation for any parameter value in the support of the prior distribution. While this is true under very mild conditions and with an appropriate definition of support, it begs an important question. If
A sequence of coin flips illustrates this point in a simple, parametric setting.
where the Bernoulli trials are, conditional on
The posterior distributions corresponding to these prior distributions vary considerably. For data suggesting the coin is fair, all posteriors are consistent with the data. However, consider the behavior when the the sample proportion,
Neither the development of subjective probability nor the decision theory-to-admissibility route to Bayes precludes the choice of a prior distribution with restricted support. A productive view is that the parameter
False consistency, focusing on the convergence of estimates or posterior distributions, provides a counterpoint to consistency. We say that an inference is falsely consistent if the inference is for something that has never been observed and the inference becomes degenerate.
A simple example illustrates the concept. Suppose that we observe a sequence of
so that
If one assumes that
Consider the conditional distribution of
Is this concentration of posterior good? or bad? For me, the answer depends upon the distribution from which the data arise. If the distribution is indeed normal, the limiting estimated distribution is correct; if not, the limiting estimated distribution will typically differ from the actual conditional distribution. The result is similar to the earlier Bernoulli example with the uniform prior distribution for
Returning to Cromwell’s dictum, it is certainly possible that the distribution of the
To produce the desired behavior, we need a prior distribution with large support. The support should allow for consistent estimation of the distribution of
Nonparametric Bayesian methods take their name from the definition of nonparametric as not describable by a finite number of parameters.. The methods rely on models for which the effective number of parameters grows with the sample size. There are two main streams of Bayesian nonparametrics. One stream focuses on a function such as a regression function and replaces the traditional linear regression with a much more flexible regression. The prior distribution in this case is a probability model for “wiggly curves”. We do not pursue this stream further here. The other stream focuses on probability distributions. It replaces the traditional parametric family of distributions with a much larger family. Instead of placing a prior distribution on a parametric family–that is, instead of writing a probability model for the parameters that govern the distribution, we write a probability model for the distributions themselves. To do so takes some technical work, with the first fully satisfying papers written in the early 1970s, and with antecedents stretching back at least to the 1950s.
The early boom in Bayesian nonparametrics was touched off by Ferguson’s (1973) major paper which defines the Dirichlet process. He provides two basic properties that a prior distribution should have in order to qualify as both nonparametric and useful.
The prior distribution should have full support in some relevant space.
One should be able to perform the update from prior distribution to posterior distribution.
At the time of Ferguson’s work, models were simpler and computational resources were limited. Full support for a distribution function translated to full support among distributions on ℛ
The Dirichlet process plays a central role in Bayesian nonparametrics. In part, this is due to its being the first Bayesian nonparametric model to be developed; in part, this is due to its having several valuable representations, each of which lends itself to useful modelling. The remainder of this section describes constructions and basic properties of the Dirichlet process. For complete details, see the references.
The Dirichlet process is connected to the beta distribution, and through it to the multinomial distribution. The beta-binomial model has the form
where
The Dirichlet-multinomial model extends the beta-binomial model to more categories for the observable data.
where all
The categories of the multinomial distribution can be refined as needed, as shown in Figure 1 for data on an interval. For each partition, we write a Dirichlet-multinomial model.
There are two big questions about this refinement. The first is, if we pass to the limit, refining categories so that each pair of reals eventually fall into different categories, does a limiting “Dirichlet-multinomial” exist? The second is, if the limiting object exists, how do we work with it? Ferguson (1973) passed to the limit, answering both questions and constructing the Dirichlet process.
To keep the models consistent across refinements of the partition, we require the same probability statements about a fixed, coarser category given any partition. To ensure this self-consistency while staying withing the Dirichlet-multinomial framework, when a category is split, its parameter is divided and allocated to the pieces of the split. For example, if category 1 is partitioned into two subcategories, say 11 and 12, we require
The refinement of partitions suggests the gamma process construction of the Dirichlet processs. Traditional distribution theory provides a link between beta distributions and gamma distributions. If
The Dirichlet-multinomial refinement splits category parameters, while the refinement for the gammas splits the shape parameters in the same way. The gamma is an infinitely divisible distribution, and so the refinement is self-consistent through the limit.
Turning the problem around, we start with the limiting process and ask how we can match every Dirichlet-multinomial refinement. The limiting process is called a Dirichlet process. For a prior on distributions on the real line, we do so by replacing the parameter vector
The limiting model is written as
The parameter
The construction of the Dirichlet process tells us how to find the posterior distribution. For the betabinomial and Dirichlet-multinomial problems, the posterior is in the same family with parameter vector updated with category counts. Equivalently, we consider the independent gamma distributions with shape parameres updated with the category counts. Passing to the limit, we update the gamma process with a point mass at each observed value. The corresponding gamma process has measure
The prior predictive distribution for the Model (
The predictive distribution for an observation from this model is the base cdf
The posterior predictive distribution of (
The conjugate pair of prior and likelihood leads to a posterior in the same family. The posterior predictive distribution follows the same form, with updated base measure,
Blackwell and MacQueen (1973) present a distinctly different view of the Dirichlet process. They focus on the successive predictive distributions for a sequence of observations drawn from
Blackwell and MacQueen view this as an urn scheme. The initial draw is taken from an urn containing a rainbow of colors. When the draw (color) is observed, it is returned to the urn along with an extra ball of the same color. The draws proceed sequentially, always according to the same rule. It is easy to see that these draws lead to the same joint distribution on the observable sequence of
where w.p. stands for “with probability”.
The Polya urn scheme yields an easy proof of the discreteness of the Dirichlet process. From the above description, we see that the probability that
The discreteness of the Dirichlet process poses problems for its use as a model for observable data. The most commonly used models for discrete data, such as those in an exponential family, have the same support for all distributions in the family. This provides an element of robustness for likelihood-based analyses, as a single observation does not zero out any parameter values and so has a limited impact on inference. In contrast, two
To alleviate this problem of discreteness, the Dirichlet process is used as a latent stage in a hierarchical model. The distribution
The distribution
One specific MDP model is a conjugate location mixture of normal distributions, with
The MDP model is often embedded as part of a larger hierarchical model. A prior distribution may be placed on the parameters governing the base cdf, the mass parameter, or jointly on the two. Prior distributions are placed on the parameters in the smoothing kernel
Sethuraman (1994) provided an alternative construction of the Dirichlet process which follows from the Polya urn scheme and the internal consistency of a Bayesian model. Under his construction, the distribution
The
A stream of data arising from
This construction directly targets
The parameter of the Dirichlet process is variously given as the measure,
Bush
It should be noted that Bayes factors for the MDP model (and indeed for directly observed data) are problematic. Ferguson (1973) cautions the reader about the impact of discreteness on tests. Xu
Computational methods for the mixture of Dirichlet process model are well developed. Escobar’s (1988) landmark dissertation work (also Escobar, 1994) developed a Markov chain Monte Carlo (MCMC) strategy for fitting the models in a paper that predates Gelfand and Smith’s (1990) development of the Gibbs sampler for general Bayesian inference. The centerpiece of this computational strategy is to set up a Markov chain with the posterior distribution as its limiting distribution, and to then obtain a realization of the Markov chain.
The Gibbs sampler relies on a sequence of draws from conditional distributions. The sampler itself works on a marginalized version of the model, removing the infinite-dimensional
The conditional distribution for
The superscript – indicates that
This early Gibbs sampler suffers from poor mixing. Once placed, a value of
There have been many advances since these early samplers. Escobar and West (1995) show how to place a distribution on
The work on MCMC methods for the MDP model spilled over to impact MCMC methods for parametric Bayesian models. The early examples of marginalization and reparameterization gave rise to an understanding of the value of multiple reparameterizations of a model within an MCMC cycle. These techniques implicitly involve parameter expansion and marginalization to improve mixing. They provide a clear view of the benefits of non-identifiability in MCMC (MacEachern, 2007). These innovations all flow from the multiple representations of the Dirichlet process described in the preceding section.
Additional computational strategies have been explored. Sequential importance sampling (Liu, 1996; MacEachern
Current computational work is focused on the development of algorithms that scale up to much bigger data sets. Three themes are showing their value across a broad range of techniques. First, a willingness to accept analytic approximation as a necessary tradeoff for scale and speed. Second, a deliberate oversimplification of the model to admit conjugate forms which often serves the dual purpose of enabling fast computation and improving approximations. Third, a truncation of a complex model with estimates plugged in for some parameters to avoid the time needed for exploration of the full posterior distribution.
One strategy which has proven effective for scaling algorithms is to find an algorithm which is “exact” for a narrowly defined problem and to then apply a version of the algorithm in other cases. In the MDP setting, a redistribution of mass algorithm (MacEachern, 1988) and an associated direct sampling algorithm (Kim, 1999; Kuo and Smith, 1992) allow for exact evalution of and simulation from the posterior distribution for survival data. Newton and Zhang (1999) proposed a variant on this method to quickly approximate the posterior in non-survival settings. Martin and Tokdar (2009) pursue this approach and establish asymptotic properties. In a similar vein, variational methods (Blei and Jordan, 2006) are exact in certain simple settings, but become approximations in more complex settings. They stand our for providing a quick, scalable approximation to the posterior. The challenges of variational methods include choice of an implementation and understanding the accuracy of the approximation. Data-splitting techniques seek to perform calculations with some level of parallelization, thereby allowing larger data sets. Jara’s DPPackage (Jara
There are two main perspectives for use of the MDP model. These perspectives mimic general developments in the Bayesian community since the 1970s. One follows the objective Bayesian philosophy, seeking to impose little structure on a problem and seeking to incorporate a minimum of subjective information into the analysis. Under this approach, the model is used for density estimation, and inferences are extracted from the density estimate. Lo (1984) espoused this philosophy, although its implementation in realistic problems was suspended until computational developments made the method practical. Escobar and West (1995) pursue this strategy, focusing on the one-sample problem in one or several dimensions.
Müller
Developments in density estimation include theoretical results as well as empirical implementation. Results current at the time are presented in Ghosh and Ramamoorthi (2003), while further developments include Ghosal and van der Vaart (2001) and Walker (2004). These authors lay the theoretical foundation for density estimation with the MDP model, providing a selection of results on consistency of the estimators and on rates of convergence.
Experience with the density estimators has produced improvements on the basic MDP model. Griffin (2010) modified the model by partitioning variance between the base measure and the smoothing kernel. Empirical evidence suggests that this partition improves density estimation. Bean
The second perspective on use of the MDP model follows the development of the hierarchical model. It is driven by first determining the structure of the model and then filling in appropriate distributions for the various parts of the model. The Dirichlet process or MDP model typically appears in a portion of the model where the analyst desires full support. The literature contains many examples of the successful use of this strategy.
The most evident uses of the Dirichlet process or MDP model are for portions of the model devoted to what, in a classical analysis, would be termed random effects. Bush and MacEachern (1996) argue that full support for random effects means full support on the distributions from which independent and identically distributed draws are made–that is, the model is used to produce effects which are exchangeable for all
For linear regression problems, Gelfand and Kottas (2002) provide median zero errors and describe a computational strategy for inference for essentially arbitrary functionals of the Dirichlet process. MacEachern and Guha (2011) explain how the posterior for regression coefficients can be more concentrated with an MDP model for the errors than with a normal model for the errors. Wang (2009) considers analogs of weighted least squares, distinguishing between models for a scale family, a convolution family, and families in between.
The decision-theoretic motivation for nonparametric Bayesian methods relies on the concept of admissibility. Admissibility is, in turn, closely linked to the development of empirical Bayesian methods (Efron and Morris, 1975) which in many cases are popularly equated with “shrinkage”. In this subsection, we revisit a famous data set, tinkering with the example to investigate the notions of admissibility and shrinkage and contrasting them with nonparametric Bayesian modelling.
Efron and Morris (1975) motivated empirical Bayesian methods with an example which is compelling to all who are familiar with baseball. They set up a prediction problem for players in the United States’ major leagues. The data from which the predictions are to be made consist of the results of the first 45 at bats for a set of players, described in terms of their batting average (here, we take batting average to be the proportion “hits divided by at bats”). The goal is to predict, based only on these data, each player’s batting average for the remainder of the season. The quality of a single forecast is judged by squared prediction error, and the quality of the collection of forecasts is judged by sum of squared prediction error.
Baseball fans know much about batting averages. At the time, the league-wide average was about 0.250, with the highest averages in the mid 0.300 s. It would be very unlikely that any season-ending average would be as high as 0.400. Various players are good hitters (here, a high batting average) or are poor hitters, and the differences can be substantial. The data, consisting of all players with 45 at bats on a particular day, exclude pitchers who would have too few at bats. They do include one exceptionally good hitter, by design. Also, after a mere 45 at bats, there is substantial variation in the sample batting average. The observed averages of 0.156 to 0.400 show a clear excess of variation–all would agree that the large values should be pulled down, and the small values pushed up.
Formally, we begin with a too-simple model for the data, where
The original focus was on the comparison of estimation techniques such as maximum likelihood or method of moments to empirical Bayesian estimation in the James-Stein style, or shrinkage estimation. Defining the remainder of the season batting average for player
where the cross-product terms disappear due to the presumed independence of the observed data and the remainder of the season performance, conditional on the
Our comparison begins with maximum likelihood estimation. The playerwise maximum likelihood estimate is
This estimate results in SSPE_{2} = 0.0267, showing a strong advantage for the shrinkage estimator relative to maximum likelihood.
Our third estimation strategy relies on a nonparametric Bayesian model, a straightforward mixture of Dirichlet processes in the style of Berry and Christensen (1979). The model is, with
The base measure for the Dirichlet process component of the model is Lebesgue measure on (0, 1).
We fit the model in (
The substantial improvements over maximum likelihood in this example come as no surprise to those who follow baseball, as it is clear that the maximum likelihood estimates need to be adjusted in some fashion. However, admissibility suggests much more–namely that the type of strategy we pursue should bring benefits for any parameter vector
The three approaches were fit with
Figure 3 summarizes the results of the study. The performance of maximum likelihood is unaffected by the reversals, and SSPE_{1} remains constant at 0.0857. As expected, the shrinkage estimator performs more poorly than before, with SSPE_{2} increasing as the number of reversals is increased. With 9 reversals, the player-wise estimates are shrunk to something near 0.500. The vertical lines in the figure extend from the 25
The second panel in Figure 3 quantifies the benefits of shrinkage estimation and nonparametric Bayesian techniques relative to maximum likelihood. The figure shows the percentage decrease in SSPE above the noise floor, relative to maximum likelihood. The decreases for shrinkage estimation with no reversals and for the nonparametric Bayes method, with or without reversals, are extreme, holding steady at about 80%. Decreases for shrinkage estimation with reversals range from the mid teens to slightly above 20%.
The success of shrinkage estimation lies in its dual as a Bayes estimator. Although the usual description is in terms of shrinkage, the estimates exactly match posterior means from a set of conjugate beta-binomial models, one for each player. For the real data, the beta prior distribution is set from the data, with a prior mean of
The nonparametric Bayesian approach focuses on modelling, and the prior distribution has full support for distributions on the interval. When there are no reversals, the data themselves drive the posterior distribution on the unknown
The Dirichlet process and its extension to the MDP model are widely used. There are many additional Bayesian nonparametric methods which serve as prior distributions for an unknown distribution. Nearly all of them are suited to use as a component in a hierarchical model. Some focus on low-dimensional distributions, some on discrete problems, some on particular applications. These methods are often motivated by a shortcoming of the MDP model and seek to repair the shortcoming. Most have parallel tracks of applied modelling, computational strategies, and theoretical results. This section provides capsule descriptions of a few of the more prominent of these methods.
The early variations on the Dirichlet process addressed difficulties in implementation. Directly observed data are easier to handle than indirectly observed data. Coupling this with the tailfree nature of the Dirichlet process (Doksum, 1974) suggested modifications appropriate for survival analysis. Dykstra and Laud (1981) considered monotonicity of a hazard rate, developing the extended gamma process which is computationally conjugate for survival data consisting of exact event times and right censored event times. Hjort (1990) created the beta process which has become the mainstay of Bayesian survival analysis. Walker
There are several ways to model continuous distributions. The Polya tree (Lavine, 1992; Mauldin
Several variants on the Dirichlet process as a countable mixture distribution have been developed. Ishwaran and James (2001) pursue Pitman-Yor processes, adding an extra parameter to the Dirichlet process. James
The early work on nonparametric Bayesian methods focused on a single distribution. From this base, a variety of authors began to build through the standard progression of models seen in undergraduate and graduate coursework. The one-sample problem (a single distribution) leads to the
It would be a major undertaking to describe the work on collections of distributions and how they relate to data. A brief overview follows. Müller
MacEachern (1999, 2000) developed dependent Dirichlet processes to address the regression problem in a nonparametric fashion. For a simple example, imagine the growth charts by which children’s health is tracked in many countries. For each given age, percentiles for height, weight, and other physical characteristics are established. These percentiles show clear evidence of non-normality that would be difficult to capture with a low-dimensional model. For this snapshot at any given time, a nonparametric model is appropriate. Considering growth, the distributions are continuous in time. The technology that is needed is a mechanism for placing a probability distribution on a collection of nonparametric distributions which evolve in a continuous fashion as a covariate changes, here time. Four main properties were proposed: (i) large (full) support at any fixed covariate value, (ii) feasible computation for updating prior to posterior, (iii) an understandable marginal distribution at each fixed covariate value, and (iv) continuity of the realized distributions.
The key to constructing a model with these properties is Sethuraman’s construction of the Dirichlet process. For a simple version of the dependent Dirichlet process, the single-p model, we replace the draw of a scalar or vector
This recipe for constructing a prior distribution for a collection of distributions has considerable flexibility. The main ingredients are a covariate space which doubles as an index set for stochastic processes, a stochastic process for the locations (the
The development of these processes expanded a line of work investigating the split between modelling and inference where modelling takes place in a high-dimensional space and inference in a much lower dimensional space (MacEachern, 2001; Walker and Gutiérrez-Pena, 1999). Recent developments stress the differences between inference in the traditional, low-dimensional parametric setting and in high-dimensional settings (Hahn and Carvalho, 2015; Lee and MacEachern, 2014).
Many variants on and special cases of the dependent Dirichlet process have been developed. MacEachern
Distinct collections of models have been developed for alternative data structures. To name but two, Griffiths and Ghahramani’s (2011) Indian buffet process opened new territory for latent feature models. Orbanz and Roy (2015) pursue the full support argument and create deep results and effective models for network data.
Only a few decades ago, Bayesian methods were regarded by many in the statistics community as a curiosity. They had strong theoretical support, but there was great difficulty in implementing the methods, especially beyond low-dimensional conjugate settings. With the development of the hierarchical model, modern computational strategies, and the widespread availability of complex data and fast computing, the methods have proven their value in studies throughout the academic and corporate worlds.
Similarly, many Bayesians in the recent past considered nonparametric Bayesian methods to be a curiosity with strong theoretical support but great difficulty in implementation. The developments that have propelled the success of Bayesian methods have also propelled the success of nonparametric Bayesian methods. Furthermore, in our current data-rich environment, the typical parametric model can be falsified, with a clear need to use a more complex model. Rather than expanding the model slowly by adding a parameter or two at a time, nonparametric Bayesian methods jump directly to an infinite-dimensional parameter space. To make this jump successfully requires some experience with model building and some knowledge of these models in particular. Once this experience is gained, the empirical evidence of success, across a range of problems and by many practitioners, is overwhelming. With a wealth of data and sophisticated models and computation in place, nonparametric Bayesian approaches are assured of an interesting and active future.
This review has touched on only a small portion of the work on nonparametric Bayesian methods. The papers I refer to are tilted toward the topics I have chosen for the review and toward references I am more familiar with. Additional excellent reviews and book include Müller
This work was supported in part by the United States National Science Foundation grant number DMS-1613110. The views expressed in this work are not necessarily those of the NSF. This paper This paper closely follows a tutorial presented at the meeting of the Bayesian Statistics Section of the Korean Statistical Society on Jeju Island in 2016. The supplementary materials contain the slides from the tutorial.