Tree-based regression and classification ensembles form a standard part of the data-science toolkit. Many commonly used methods take an algorithmic view, proposing greedy methods for constructing decision trees; examples include the classification and regression trees algorithm, boosted decision trees, and random forests. Recent history has seen a surge of interest in Bayesian techniques for constructing decision tree ensembles, with these methods frequently outperforming their algorithmic counterparts. The goal of this article is to survey the landscape surrounding Bayesian decision tree methods, and to discuss recent modeling and computational developments. We provide connections between Bayesian tree-based methods and existing machine learning techniques, and outline several recent theoretical developments establishing frequentist consistency and rates of convergence for the posterior distribution. The methodology we present is applicable for a wide variety of statistical tasks including regression, classification, modeling of count data, and many others. We illustrate the methodology on both simulated and real datasets.
Tree-based regression and classification ensembles form a standard part of the data-science toolkit. Consider the generic problem of estimating the distribution
Decision trees have attracted the interest of practitioners due to their strong empirical performance, and many algorithmic methods for constructing decision trees exist (Breiman
Bayesian approaches fundamentally differ from these algorithmic approaches in that they posit a full probability model for the data, combining a prior distribution for with a likelihood for the data = {(
Bayesian decision tree methods have experienced rapid development in recent years, and the purpose of this article is to offer a timely review. We first provide a basic review of the Bayesian classification and regression trees (BCART) and Bayesian additive regression trees (BART) methods, and provide an overview of the techniques used to fit them. We then make connections to other machine learning techniques, including random forests, boosting, and Gaussian process regression. Additionally, several recent works have provided theoretical justification for these methods by establishing optimal posterior convergence rates for variants of BCART (Rockova and van der Pas, 2017) and BART (Linero and Yang, 2017; Rockova and van der Pas, 2017); we review some of these developments.
Our notation follows Chipman
The BCART was introduced by Chipman
where
The above approach extends naturally to other semiparametric models; for example, one obtains a natural extension to a generic exponential family
We consider priors on ( ) of the form
That is, the
We require a prior distribution for the tree structure. A common choice is to use a branching process prior, described initially by Chipman
Once the tree topology is generated, we associate to each branch a
There are several alternatives in the literature to the prior described above, differing in how the topology of is generated. Denison
The BART framework introduced by Chipman
where
The BART algorithm applies most naturally to the semiparametric regression problem; repeated experience has shown that it typically outperforms BCART in this setting, and is typically competitive with state-of-the-art methods. Additionally, it is straight-forward to extend BART to classification using the data augmentation strategy of Albert and Chib (1993). Much further work has gone into extending BART to various other problems, including survival analysis (Sparapani
BCART is more flexible as a framework than BART in the sense that it generalizes more easily to non-regression settings. Whereas BCART allows for essentially any type of object in the leaf nodes (scalars, probability vectors, survival functions, etc.), one must be able to add together the parameters in the leaf nodes of BART, limiting us to leaf parameters in ℝ (or possibly ℝ
Because the estimates of BART correspond to sums of step functions, BART produces estimates which are not smooth. To address this issue, Linero and Yang (2017) proposed
Bayesian tree-based models ostensibly conduct model selection through which variables are used to construct splits. As noted by Linero (2016), however, Bayesian tree-based methods are not immediately applicable for variable selection due to the tendency for spurious predictors to be selected once the number of branches becomes sufficiently large. This issue is most obvious in the case of BART, which naturally produces ensembles with many branches, but is also expected to occur with BCART when sufficiently deep trees are required. One attractive option to conduct variable selection is to place a sparsity-inducing Dirichlet prior on the vector of splitting proportions
Linero (2016) shows that, given that the ensemble includes
Constructing efficient algorithms for exploring the posterior is perhaps the biggest standing problem in the widespread adoption of Bayesian methods for decision trees. While Markov chain Monte Carlo (MCMC) is the frequently used, we show in Section 4 that the widely available implementations of BCART generally perform poorly compared to recent particle filtering algorithms (Lakshminarayanan
Consider first BCART under a prior respecting (
Efficient inference requires that the marginal likelihood can be computed efficiently, while the normalizing constant is not required for further computations.
where
where, using the same notation as in the Poisson setting,
With the marginal likelihood in hand, one then typically samples approximately from using the Metropolis-Hastings algorithm (Hastings, 1970), which uses a
and set with probability ). The difficulty with lies in constructing useful proposals . This is because there are typically a large number of trees with non-negligible posterior probability which are structurally dissimilar, separated by decision trees with low posterior probability. Hence, if places its mass on trees which are structurally similar to , one expects the sampler to get stuck in a local mode of . Works addressing the construction of proposal distributions to use with Metropolis-Hastings include Chipman
The transition function is constructed as a mixture distribution over a collection of possible modifications to . The following modifications proposed by Chipman
Grow: Randomly select and turn it into a branch node with two children, randomly sampling the predictor and cutpoint for the splitting rule from the prior.
Prune: Select
Change: Randomly select and change the decision rule [
To complement these local moves, Wu
Several authors have proposed sequential Monte Carlo (SMC) algorithms to approximate the posterior distribution, bypassing MCMC. By taking advantage of particle systems which evolve (almost) in parallel, they are also able to explore different modes of the posterior. Additionally, SMC methods may be easier to implement in practice than MCMC methods. The strategies described here are based on a generic
To apply SMC it is helpful to introduce auxiliary variables with joint distribution so that we can view the model as a state-space model. Lakshminarayanan
where denotes a point-mass distribution at . To prevent weight degeneracy, one may resample from the approximation of and set
Taddy
BART is fit by combining the Metropolis-Hastings schemes in Section 3.1 with a so-called Bayesian backfitting algorithm (Hastie and Tibshirani, 2000). The use of shallow trees by BART has a large computational payoff, as the “local moves” described in Section 3.1 are much more effective at exploring the posterior of shallow trees. Consider the regression setting with
After conditioning on ( ), this reduces to the BCART problem for regression, with the pseudo-response
We now compare and contrast the various methods presented here on both simulated and real datasets. First we consider the
From Figure 3, we see that BCART and dynamic trees induce the least amount of smoothing, and essentially produce step functions. For comparison, the random forests algorithm produces a smoother estimate. BART smooths the data more than BCART, and produces a fit which is very similar to gradient boosting and random forests. The smoothest fit is given by SBART, which takes advantage of an assumed continuity of the regression curve.
Our experience is that the additional smoothing obtained by random forests, BART, gradient boosting, and SBART almost always leads to improved performance. Moreover, it is not clear if the lack of smoothness in BCART and dynamic trees is due to fundamental properties of the prior or is merely a symptom of inaccuracies in the Monte Carlo approximations. We have found dynamic trees to generally perform better than BCART, and speculate that this is due to the use of particle filtering rather than MCMC to fit the model.
To see more clearly the potential benefits of smoothing, we consider
where
On (
Finally, we consider the performance of these methods on the Wisconsin breast-cancer detection dataset. Given
We consider the
As restaurants are generally quite heterogeneous, one expects the
We set
where
where
where
Figure 7 displays the fit of BCART to the
Tree-based Bayesian methods possess a number of interesting connections to their algorithmic cousins. We describe points of contact between Bayesian methods and conventional machine learning methods. In addition to the obvious connections between BCART and random forests and connections between BART and boosting, we provide less well-known connections between BART and kernel-based methods, focusing in particular on Gaussian process regression.
BCART is similar operationally to random forests. For regression tasks, both methods make predictions of the form
where is the expectation operator associated to ( ) ~ ℙ for some probability distribution ℙ. Further, both approximate the expectation in (
BCART and random forests differ in the distribution ℙ that they aim to sample from. A large advantage random forests have over BCART is that one can sample
BART has an interesting connection to kernel-based estimators. Note that the BART model can equivalently be written
The random variables {
where [
“Theorem” 1 is heuristic because a formally correct statement requires technical conditions on the prior Theorem 1 is intuitively implied by the convergence of the finite dimensional joint distributions noted above, but giving a formal statement and proof is an exercise in empirical process theory (Van der Vaart and Wellner, 1996) that we omit.
This connection to Gaussian processes is interesting from several standpoints. First, it provides a link between BART and kernel-based methods such as Gaussian process regression and support vector machines. Second, it formalizes several known connections between decision tree ensembling methods and kernel-based methods. For example, Scornet (2016) provides connections between random forests and kernel-methods, which were known also to Breiman (2000). Remarkably, kernels corresponding the BART, random forests, and Mondrian forests are very similar and often coincide; in particular, the kernels are similar to the exponential kernel described in the following proposition.
A very similar result is obtained by Balog
A natural question one might ask is “given this connection, why not use Gaussian process regression directly?” First, Gaussian process regression does not scale favorably with the sample size; without making any approximations, training a Gaussian process regression model requires Θ(
A natural question one might ask is “given this connection, why not use Gaussian process regression directly?” First, Gaussian process regression does not scale favorably with the sample size; without making any approximations, training a Gaussian process regression model requires Θ(
The original motivation for the BART algorithm given by Chipman
BART differs from boosting in a number of ways. First, boosting algorithms are greedy, and do not update trees once they are included in the model. By contrast, BART makes use of a Bayesian backfitting which continuously updates all trees in the model. Second, boosting algorithms run the risk of overfitting the data as more trees are added, while BART is shielded from overfitting as more trees are added due to its link with Gaussian processes noted in Section 5.2. Third, the Bayesian paradigm also allows for automatic tuning of hyperparameters through the use of priors, while boosting is typically tuned by cross-validation.
Beyond consistency, one may be interested in the convergence rate of the posterior. A natural target to aim for is the minimax rate for a particular estimation problem. For the semiparametric regression problem
The limitation
In view of Rockova and van der Pas (2017), there is no difference in convergence rate between BART and BCART for the problem of estimating a sparse function
In this paper we have reviewed Bayesian tree-based models, focusing on recent developments in computation and modeling. We also outlined connections to more traditional machine learning algorithms and reviewed some exciting results on posterior consistency which fill a previously long-standing gap between theory and practice. Much work remains to be done; in particular, further improvements in computation would make these methods more feasible in structured regression problems where boosting algorithms currently dominate (Nielsen, 2016).
Also important to the dissemination of these methods is the development of software, and currently there are many options. The fits in this paper were obtained using the
This research was partially supported by NSF grant DMS-1712870 and DOD grant SOT-FSU-FATS-16-06.