TEXT SIZE

CrossRef (0)
A review of tree-based Bayesian methods

Antonio R. Linero

aDepartment of Statistics, Florida State University, USA, 1Department of Statistics, Florida State University, 600 W. College Avenue, Tallahassee, FL 32306 USA. E-mail: arlinero@stat.fsu.edu
Received October 7, 2017; Revised November 17, 2017; Accepted November 17, 2017.
Abstract

Tree-based regression and classification ensembles form a standard part of the data-science toolkit. Many commonly used methods take an algorithmic view, proposing greedy methods for constructing decision trees; examples include the classification and regression trees algorithm, boosted decision trees, and random forests. Recent history has seen a surge of interest in Bayesian techniques for constructing decision tree ensembles, with these methods frequently outperforming their algorithmic counterparts. The goal of this article is to survey the landscape surrounding Bayesian decision tree methods, and to discuss recent modeling and computational developments. We provide connections between Bayesian tree-based methods and existing machine learning techniques, and outline several recent theoretical developments establishing frequentist consistency and rates of convergence for the posterior distribution. The methodology we present is applicable for a wide variety of statistical tasks including regression, classification, modeling of count data, and many others. We illustrate the methodology on both simulated and real datasets.

Keywords : Bayesian additive regression trees, boosting, random forests, semiparametric Bayes
1. Introduction

Tree-based regression and classification ensembles form a standard part of the data-science toolkit. Consider the generic problem of estimating the distribution f (y | x) of some response conditional on a predictor . Tree-based methods recursively partition to obtain a decision tree such that the information in X about Y is contained entirely in which leaf node X falls in; this process is illustrated in Figure 1 when X ∈ [0, 1]2. Once the tree is constructed, we associate to each leaf node η a parameter θη and set f (y | x) = f (y | θη(x)) where η(x) denotes the leaf node associated to x. For example, we might obtain a semiparametric regression model by setting $Y~Normal(μη(X),ση(X)2)$.

Decision trees have attracted the interest of practitioners due to their strong empirical performance, and many algorithmic methods for constructing decision trees exist (Breiman et al., 1984; Kass, 1980; Quinlan, 1993). An essential fact about decision trees is that a given dataset can typically be described well by many different tree structures. This causes greedily-constructed trees to be very unstable, with small perturbations of trees leading to vastly different tree topologies. This motivates methods based on ensembles of decision trees, such as bagging (Breiman, 1996) and random forests (Breiman, 2001), which increase estimation stability by averaging predictions over the bootstrap distribution of the decision tree (Efron and Gong, 1983).

Bayesian approaches fundamentally differ from these algorithmic approaches in that they posit a full probability model for the data, combining a prior distribution for with a likelihood for the data = {(Xi, Yi) : 1 ≤ iN}. This approach falls in the framework of Bayesian nonparametrics/semiparameterics (see MacEachern, 2016 for a review) This viewpoint allows for natural uncertainty quantification using the posterior distribution . Furthermore, with no additional effort, the Bayesian approach naturally has many of the features of state-of-the-art tree ensembling methods. For example, the process of averaging predictions across samples from the posterior distribution is similar to the averaging which occurs in the random forests algorithm, naturally leading to stability in inferences.

Bayesian decision tree methods have experienced rapid development in recent years, and the purpose of this article is to offer a timely review. We first provide a basic review of the Bayesian classification and regression trees (BCART) and Bayesian additive regression trees (BART) methods, and provide an overview of the techniques used to fit them. We then make connections to other machine learning techniques, including random forests, boosting, and Gaussian process regression. Additionally, several recent works have provided theoretical justification for these methods by establishing optimal posterior convergence rates for variants of BCART (Rockova and van der Pas, 2017) and BART (Linero and Yang, 2017; Rockova and van der Pas, 2017); we review some of these developments.

2. Bayesian classification and regression trees

Our notation follows Chipman et al. (2013), who also provide a review of tree-based methods. Associated to a tree we have a collection of internal (branch) nodes with decision rules and leaf nodes . We let denote the unique leaf node associated to x. Associated to each is a parameter where .

The BCART was introduced by Chipman et al. (1998) and Denison et al. (1998), and provides the basis for most of the subsequent developments. To simulate from the BCART prior, one first draws a tree structure from a prior (see Section 2.1) and next samples, for each leaf , a parameter vector θη. The response is then modeled as (Yi | Xi = x, ) ~ f (y | θη(x)) for some family of densities f (y | θ).

Example 1. [Semiparametric regression] Suppose Yi is continuous. We specify the model

$Yi=μ(Xi)+ɛi, ɛi~Normal(0,σ(Xi)2),$

where μ(x) = μη(x) and $σ2(x)=ση(x)2$. Additional flexibility can be obtained by considering a locally linear model Yi = α(Xi) + β(Xi)Xi + εi; similarly, Gramacy and Lee (2008) allow for the within-leaf models to be Gaussian process regressions (Rasmussen and Williams, 2006). The fits of these models to the lidar data described in Section 4.1 are given in

Example 2. [Poisson regression for count data] Suppose Yi is a count. As an alternative to a Poisson log-linear model, we set Yi ~ Poisson(λη(Xi)). For an application of this type of model to overdispersed count data, see Section 4.2.

Example 3. [Survival analysis with non-proportional hazards] Suppose Yi is a survival time, potentially subject to right-censoring. We set Yi ~ Sη(Xi)(t) where S η(t) is a nonparametrically modeled survival function given, e.g., a Pölya tree prior (Muliere and Walker, 1997).

The above approach extends naturally to other semiparametric models; for example, one obtains a natural extension to a generic exponential family f (y | θ) = exp {B(θ) + C(y)}. The computational algorithms described in Section 3 are applicable whenever θη is given a prior for which the marginal likeihood can be efficiently computed. Chipman et al. (1998) and Denison et al. (1998) consider, in particular, the classification problem in which f (y | x) is categorical with Y ~ Categorical(θη(X)) where θη ~ Dirichlet(α1, . . . , αC) is a leaf-specific distribution over categories. These extensions can be viewed as semiparametric competitors to classical generalized linear models (McCullagh, 1984) or generalized additive models (Hastie and Tibshirani, 1987). An attractive feature of these models is that they naturally allow for the inclusion of complex interactions between predictors.

### 2.1. Priors on tree structures

We consider priors on ( ) of the form

$π(T,ΘT)=πT(T)·∏η∈ℒTπθ(θη).$

That is, the θη’s are conditionally independent given the tree structure .

We require a prior distribution for the tree structure. A common choice is to use a branching process prior, described initially by Chipman et al. (1998). For each node η in the tree, we split η with prior probability q(η) = γ(1 + D(η))β, where D(η) denotes the depth of η (where the root of the tree has depth D(η) = 0). The parameters (γ, β) are parameters which control the shape of the trees. The parameter γ > 0 controls the prior probability that the tree is empty (and hence is typically large), while β > 0 penalizes trees which are too deep.

Once the tree topology is generated, we associate to each branch a splitting rule of the form [xj(η)Cη]. We set $j(η)~indepCategorical(s)$ where s = (s1, . . . , sP) is a probability vector and P is the dimension of the prediction Xi; most implementations set s j = P−1, although this choice is not ideal when the predictors do not have equal importance (Linero, 2016). Conditional on j(η) = j, we take where is a distribution Gj restricted to the set of values which do not lead to logically empty nodes when variable j is split on (if no such split exists, another predictor is selected according to s). In most publicly available software Gj is the empirical distribution of (Xi j : 1 ≤ iN).

There are several alternatives in the literature to the prior described above, differing in how the topology of is generated. Denison et al. (1998) suggest a hierarchical prior in which a prior is placed on the number of leaf nodes L = | | ~ 1 + Poisson(λ) and, conditional on L, a uniform distribution is placed on the tree topologies with L leaves. Wu et al. (2007) construct a “pinball” prior on the topology of , which gives finer control over the shape of the resulting trees. Finally, Roy and Teh (2009) introduced the Mondrian process, which gives an elegant joint prior on all components of (more precisely, the Mondrian process is a stochastic process { : t > 0}, with larger values of t corresponding to finer partitions). Lakshminarayanan et al. (2014) leverage the self-consistency properties of the Mondrian process to construct probabilistic alternatives random forests which allow online implementations.

### 2.2. Bayesian additive regression trees

The BART framework introduced by Chipman et al. (2010) replaces the single with a sum of trees

$μ(x)=∑t=1Tg(x;Tt,Θt),$

where g(x; t) = μη(t,x), η(t, x) = η if x is associated to leaf node η in tree , and Θt = {μη : }. The tree structures are then given independent priors as described in Section 2.1. The leaf parameters μη are given iid $Normal(0,σμ2/T)$ priors, with the scaling factor T chosen so that $Var{μ(x)}=σμ2$ irrespective of T. The BART model was initially motivated by analogy with boosting algorithms (Freund et al., 1999), which combine many so-called “weak learners” to construct predictions; the weak learners of choice in many boosting applications are shallow decision trees. By taking β large in the branching process prior, the BART model represents a Bayesian version of this idea with μ(x) represented as a sum of shallow decision trees.

The BART algorithm applies most naturally to the semiparametric regression problem; repeated experience has shown that it typically outperforms BCART in this setting, and is typically competitive with state-of-the-art methods. Additionally, it is straight-forward to extend BART to classification using the data augmentation strategy of Albert and Chib (1993). Much further work has gone into extending BART to various other problems, including survival analysis (Sparapani et al., 2016) and as an alternative to loglinear models (Murray, 2017). The strong predictive performance of BART has also lead to it being widely adopted in estimating average causal effects in causal inference problems (Hill, 2011).

BCART is more flexible as a framework than BART in the sense that it generalizes more easily to non-regression settings. Whereas BCART allows for essentially any type of object in the leaf nodes (scalars, probability vectors, survival functions, etc.), one must be able to add together the parameters in the leaf nodes of BART, limiting us to leaf parameters in ℝ (or possibly ℝD for some D). As suggested in Section 5.2, the BART model is perhaps best thought of as a high-quality drop-in replacement for Gaussian process priors (see Rasmussen and Williams, 2006, for a textbook level treatment).

Because the estimates of BART correspond to sums of step functions, BART produces estimates which are not smooth. To address this issue, Linero and Yang (2017) proposed smoothed Bayesian additive regression trees (SBART) which use so-called “soft” decision trees (Irsoy et al., 2012) in constructing the ensemble; rather than going left or right down the tree, the tree assigns differing weights to the left and right paths according to how close the observation is to the cutpoint. As we show in illustrations here, this additional smoothing usually produces better estimates when the underlying regression function is smooth. The cost of this additional flexibility is an increase in computational effort.

### 2.3. Variable selection and automatic relevance determination

Bayesian tree-based models ostensibly conduct model selection through which variables are used to construct splits. As noted by Linero (2016), however, Bayesian tree-based methods are not immediately applicable for variable selection due to the tendency for spurious predictors to be selected once the number of branches becomes sufficiently large. This issue is most obvious in the case of BART, which naturally produces ensembles with many branches, but is also expected to occur with BCART when sufficiently deep trees are required. One attractive option to conduct variable selection is to place a sparsity-inducing Dirichlet prior on the vector of splitting proportions

$s~Dirichlet(αP,…,αP).$

Linero (2016) shows that, given that the ensemble includes B branch nodes, the number of predictors Q included in the model has an approximate 1 + Poisson(θ) distribution, where $θ=∑i=1Bα/(α+i)$. In addition to providing variable selection, the Dirichlet prior also functions as an automatic relevance determination prior (Neal, 1995), allowing the model to adaptively learn the relative importance of each predictor.

3. Computational details

Constructing efficient algorithms for exploring the posterior is perhaps the biggest standing problem in the widespread adoption of Bayesian methods for decision trees. While Markov chain Monte Carlo (MCMC) is the frequently used, we show in Section 4 that the widely available implementations of BCART generally perform poorly compared to recent particle filtering algorithms (Lakshminarayanan et al., 2013; Taddy et al., 2011). We review the MCMC methods used to fit BCART before discussing more recent methods based on sequential Monte Carlo. We then discuss approaches for fitting BART.

### 3.1. Markov chain Monte Carlo for Bayesian classification and regression trees

Consider first BCART under a prior respecting (2.2). The posterior can be written

$πT(T∣D)=πT(T)Z(D)∏η∈ℒ∫∏i:η(Xi)=ηf(Yi∣θη) dπθ(θη)=πT(T)m(D∣T)Z(D).$

Efficient inference requires that the marginal likelihood can be computed efficiently, while the normalizing constant is not required for further computations.

Example 4. [Poisson regression] Suppose (Yi | Xi = x, ) ~ Poisson(λη(x)). Suppose that ( ) ~ Gam(a, b). Then

$m(D∣T)=baLΓ(a)L∏i=1NYi!∏η∈ℒTΓ(a+Sη)(b+nη)a+Sη,$

where L = |ℒ|, nη = |{i : η(Xi) = η}|, and S η = ∑i:η(Xi)=ηYi.

Example 5. [Gaussian homoskedastic regression] Suppose Yi = μ(Xi) + εi where μ(x) = μη(x) and $ɛi~iidNormal(0,σ2)$. Suppose that $(μn∣T)~iidNormal(m,σμ2)$. Then

$m(D∣T)=∏η∈ℒT(2πσ2)-nη2σ2σ2+nησμ2 exp {-SSE(η)2σ2-12nη (Y¯η-m)2nησμ2+σ2},$

where, using the same notation as in the Poisson setting, η = S η/nη and SSE(η) = ∑i:η(Xi)=η(Yiη)2.

With the marginal likelihood in hand, one then typically samples approximately from using the Metropolis-Hastings algorithm (Hastings, 1970), which uses a proposal distribution to construct a Markov chain , . . . with invariant distribution . If is the state of at iteration m of the Markov chain we (i) sample and (ii) set with probability

$a(T(m)→T′)=min {π(T′∣D)·q(T′→T(m))π(T(m)∣D)·q(T(m)→T′),1},$

and set with probability ). The difficulty with lies in constructing useful proposals . This is because there are typically a large number of trees with non-negligible posterior probability which are structurally dissimilar, separated by decision trees with low posterior probability. Hence, if places its mass on trees which are structurally similar to , one expects the sampler to get stuck in a local mode of . Works addressing the construction of proposal distributions to use with Metropolis-Hastings include Chipman et al. (1998), Pratola (2016), Wu et al. (2007), and Lakshminarayanan et al. (2015).

The transition function is constructed as a mixture distribution over a collection of possible modifications to . The following modifications proposed by Chipman et al. (1998) are widely used in implementations; they all suffer from the same problem of being “local” modifications to . Let denote the set of branches whose children are both leaves (i.e., is the set of branches without grandchildren).

• Grow: Randomly select and turn it into a branch node with two children, randomly sampling the predictor and cutpoint for the splitting rule from the prior.

• Prune: Select ν. Make ν a leaf by deleting its children.

• Change: Randomly select and change the decision rule [xjC] by sampling ( j,C) according to the prior.

To complement these local moves, Wu et al. (2007) propose a non-local radical restructure move, which proposes so that the data partition of is preserved; hence, one moves to a very different tree structure while simultaneously preserving the likelihood of the data. Pratola (2016) develops global Metropolis-Hastings proposals by constructing tree-rotation operators on the space of trees, and further develops a generalization of the “change” rule. These additional moves can provide massive improvements in mixing of the Markov chain, but unfortunately we are unaware of any publicly available software implementations.

### 3.2. Sequential Monte Carlo

Several authors have proposed sequential Monte Carlo (SMC) algorithms to approximate the posterior distribution, bypassing MCMC. By taking advantage of particle systems which evolve (almost) in parallel, they are also able to explore different modes of the posterior. Additionally, SMC methods may be easier to implement in practice than MCMC methods. The strategies described here are based on a generic particle filtering approach; see Cappé et al. (2007) for a review.

To apply SMC it is helpful to introduce auxiliary variables with joint distribution so that we can view the model as a state-space model. Lakshminarayanan et al. (2013) do this by considering to be the state of in the branching process prior described in Section 2.1 truncated to t steps. At each stage of the particle filter the density is targeted, where represents the marginal likelihood of the data given that . Given a proposal kernel (possibly depending on ) this leads to the sequential approximation (with $wm(0)≡1$)

$h(T(t))≈1∑m=1Mwm(t)∑m=1Mwm(t)δTm(t), wm(t)∝wm(t-1)π(T(t-1)→T(t))m(D∣T(t))q (T(t-1)→T(t))m (D∣T(t-1)),$

where denotes a point-mass distribution at . To prevent weight degeneracy, one may resample from the approximation of and set $wm(t)←∑mwm(t)/M$. There are several possibilities for choosing , but Lakshminarayanan et al. (2013) recommend simply using and choosing M large.

Taddy et al. (2011) take a different approach, directly using a dynamic model for observations i = 1, . . . , N in which an evolving predictive distribution f (Yi | Xi, ) is specified, where $D(n)={(Xi,Yi)}i=1n$. The trees are assumed to evolve according to a Markov process (growing, pruning, or staying the same). This algorithm leads to weights of the form $wm(i)=wm(i-1)f(Yi∣Xi,D(i-1),T(i-1))$ in (3.1), and we note that Taddy et al. (2011) recommend resampling steps after every update.

### 3.3. Bayesian additive regression trees and Bayesian backfitting

BART is fit by combining the Metropolis-Hastings schemes in Section 3.1 with a so-called Bayesian backfitting algorithm (Hastie and Tibshirani, 2000). The use of shallow trees by BART has a large computational payoff, as the “local moves” described in Section 3.1 are much more effective at exploring the posterior of shallow trees. Consider the regression setting with $Yi=∑t=1Tg(Xi;Tt,Θt)+ɛi$. The Bayesian backfitting algorithm proceeds by Gibbs sampling, updating ( ) where and Θt = (Θj : jt). The term “backfitting” comes from the observation that one can write

$Rit=g (Xi;Tt,Θt)+ɛi, where Rit=Yi-∑j≠tg (Xi;Tj,Θj).$

After conditioning on ( ), this reduces to the BCART problem for regression, with the pseudo-response Rit playing the role of Yi; hence, the MCMC strategies in Section 3.1 can be used with Rit in place of Yi.

4. Illustrations

### 4.1. Nonparametric regression and classification

We now compare and contrast the various methods presented here on both simulated and real datasets. First we consider the lidar, which is available in the package SemiPar in R (Wand, 2014) and consists of measurements from a light detection and ranging experiment; the goal is to use the distance light travels before being reflected to predict the ratio of received light from two laser sources. This data is amenable to the semiparametric regression model (2.1). Because the predictor is one-dimensional, it allows for easy visual comparisons of methods. We consider the BCART model given in (2.1), the BART model (2.3), as well as the dynamic regression trees model of Taddy et al. (2011). We also consider the smoothed variant SBART of BART described in Section 2.2, the random forests algorithm of Breiman (2001) as implemented in the grf package, and the tree boosting algorithm in the package xgboost. Fits to the lidar dataset are given in

From Figure 3, we see that BCART and dynamic trees induce the least amount of smoothing, and essentially produce step functions. For comparison, the random forests algorithm produces a smoother estimate. BART smooths the data more than BCART, and produces a fit which is very similar to gradient boosting and random forests. The smoothest fit is given by SBART, which takes advantage of an assumed continuity of the regression curve.

Our experience is that the additional smoothing obtained by random forests, BART, gradient boosting, and SBART almost always leads to improved performance. Moreover, it is not clear if the lack of smoothness in BCART and dynamic trees is due to fundamental properties of the prior or is merely a symptom of inaccuracies in the Monte Carlo approximations. We have found dynamic trees to generally perform better than BCART, and speculate that this is due to the use of particle filtering rather than MCMC to fit the model.

To see more clearly the potential benefits of smoothing, we consider

$Yi=μ(Xi)+ɛi, μ(x)=10 sin(πx1x2)+20(x3-0.5)2+10x4+5x5,$

where Xi ~ Uniform([0, 1]10) and $ɛi~iidNormal(0,σ2)$. This function was introduced by Friedman (1991) to illustrate the use of multivariate adaptive regression splines. We consider σ2 = 1 and a sample size of N = 250, and evaluate methods by integrated root mean squared error {RMSE = ∫(μ(x) − μ̂(x))2dx}1/2 where μ̂(x) is an estimate of μ(x) produced by each method. Results on 30 replications are given in Figure 4. In addition to the methods used on the lidar data, we also consider the treed Gaussian process described in Example 2.

On (4.1), Figure 4 demonstrates that smoother methods typically do better. The best performing method is SBART, followed by treed Gaussian processes and BART. BCART performs the worst, however the fact that it is outperformed substantially by dynamic trees suggests that this may be due to poor mixing of the MCMC scheme.

Finally, we consider the performance of these methods on the Wisconsin breast-cancer detection dataset. Given P = 10 features, the goal is to classify a tumor as benign or malignant. This dataset is available at the UCI machine learning repository (Lichman, 2013). We performed 10-fold cross validation on this dataset, and evaluated methods according to the 0–1 loss L(Y,Ŷ) = I(YŶ). We did not consider SBART or treed Gaussian processes on this dataset due to the lack of software available to fit these methods on classification data. Results on the breast cancer dataset are given in Figure 5, and the trends are essentially the same as in the semiparametric regression setting. The primary difference is that random forests outperforms the Bayesian methods, with the out-of-fold average loss being 0.026, compared to 0.032 for BART and 0.037 for dynamic trees.

### 4.2. Overdispersed count data with the Yelp dataset

We consider the Yelp dataset, which consists of the reviews for restaurants in the city of Phoenix. This data is available publicly at https://www.yelp.com/dataset/challenge. We examine the number of reviews Yi which are recorded for restaurant i. Intuitively, one might expect reviews for restaurant i to arrive roughly according to a Poisson process, leading to Yi ~ Poisson(λi).

As restaurants are generally quite heterogeneous, one expects the λi’s to vary considerably. One mode of variability is in the spatial location of the restaurant, with restaurant’s situated in more populated areas expected to accrue more reviews. In addition, unmeasured restaurant-specific variables might influence the restaurant-specific rate λi. To get a sense of this, Figure 6 displays a histogram of the Yi’s, and we note that the data is highly over-dispersed.

We set λi = δiθη(Xi) where Xi consists of the latitude and longitude of restaurant i. The model δi ~ Gam(a, b) leads to a negative-binomial model within each leaf, with likelihood

$∏i:η(Xi)=ηΓ(a+Yi)Yi!Γ(a)νηYi(1-νη)a=∏Γ(a+Yi)Γ(a)nη∏Yi!νηSη(1-νη)anη,$

where νη = θη/(θη + b) is the success probability of the negative binomial distribution with a trials. The likelihood of ν is conjugate to the Beta(α, β) distribution; using this prior, we obtain the marginal likelihood,

$m(D∣T)=∏n∈ℒT∏i:η(Xi)=ηΓ(a+Yi)Γ(a)nη∏i:η(Xi)=ηYi!·Γ(α+β)Γ(α)Γ(β)·Γ(α+Sη)Γ(β+anη)Γ(α+β+Sη+anη),$

where S η = ∑i:η(Xi)=ηYi and nη = |{i : η(Xi) = η}|. The right panel of Figure 6 displays the fit of this model, when each observation is assigned its own leaf, to the data when α = 0.65 and (β, a) are chosen by matching the moment-matching equations

$μ=aαβ-1, σ2=μ2+μβ-2,$

where σ2 is the empirical variance and μ is the empirical mean of the data (leading to a = 168 and β = 2.4). We see that the negative binomial model is capable of accurately modeling the marginal distribution of Y. Accordingly, we set (a, α, β) = (168, 0.65, 2.4) when applying BCART.

Figure 7 displays the fit of BCART to the Yelp data. The BCART model effectively de-noises the raw data and reveals areas which are more likely to have a high number of reviews.

5. Connections to machine learning algorithms

Tree-based Bayesian methods possess a number of interesting connections to their algorithmic cousins. We describe points of contact between Bayesian methods and conventional machine learning methods. In addition to the obvious connections between BCART and random forests and connections between BART and boosting, we provide less well-known connections between BART and kernel-based methods, focusing in particular on Gaussian process regression.

### 5.1. Connections to random forests

BCART is similar operationally to random forests. For regression tasks, both methods make predictions of the form

$μ^(x)=E {g (x;T,ΘT)},$

where is the expectation operator associated to ( ) ~ ℙ for some probability distribution ℙ. Further, both approximate the expectation in (5.1) through Monte Carlo, using $μ^(x)=M-1∑m=1Mg(x;T(m),ΘT(m)(m))$ to produce predictions, where ( , $ΘT(m)(m)$) are (ideally) sampled from ℙ. The random forests algorithm samples exactly $(T(m),ΘT(m)(m))~iidℙ$, while BCART instead samples from a Markov chain with invariant distribution ℙ. Consequently, BCART and random forests have similar operating properties; for example, random forests performs better as T → ∞ for the same reason that longer Markov chains are preferred with BCART (adding more samples reduces the Monte Carlo error).

BCART and random forests differ in the distribution ℙ that they aim to sample from. A large advantage random forests have over BCART is that one can sample exactly from ℙ; random forests grow trees in a greedy fashion, with randomness coming from (i) bootstrap sampling the data used to construct the tree and (ii) random sub-sampling of features when constructing splits. BCART, by contrast, takes ℙ to be the posterior distribution , which cannot be sampled from exactly.

### 5.2. Connections to kernel-based methods

BART has an interesting connection to kernel-based estimators. Note that the BART model can equivalently be written

$g(x)=T-12∑t=1Tg(x;Tt,Θt), μη~indepNormal(0,σμ2).$

The random variables {g(x; ) : t ≥ 1} are iid $Normal(0,σμ2)$ random variables, implying $g(x)~Normal(0,σμ2)$. More generally, the multivariate central limit theorem implies that, for any collection of design points (x1, . . . , xM), the vector z = (g(x1), . . . , g(xm)) has a limiting multivariate Gaussian distribution

$z→Normal(0,Σ), Σij=σμ2 Pr(xi~xj), (as T→∞),$

where [xi ~ xj] is the event that xi and xj are associated to the same terminal node in (the equality $Cov(g(xi),g(xj))=σμ2Pr(xi~xj)$ is easily checked). This suggests the following result.

Theorem 1. (Heuristic)

Consider the BART specification (2.3) with prior as described in Section 2.2 with a fixed prior(which is not assumed to be a branching process prior). Then, as T → ∞, {μ(x) : } converges to a Gaussian process with mean function m(x) = 0 and covariance function$Σ(x,x′)=σμ2Pr(x~x′)$.

“Theorem” 1 is heuristic because a formally correct statement requires technical conditions on the prior Theorem 1 is intuitively implied by the convergence of the finite dimensional joint distributions noted above, but giving a formal statement and proof is an exercise in empirical process theory (Van der Vaart and Wellner, 1996) that we omit.

This connection to Gaussian processes is interesting from several standpoints. First, it provides a link between BART and kernel-based methods such as Gaussian process regression and support vector machines. Second, it formalizes several known connections between decision tree ensembling methods and kernel-based methods. For example, Scornet (2016) provides connections between random forests and kernel-methods, which were known also to Breiman (2000). Remarkably, kernels corresponding the BART, random forests, and Mondrian forests are very similar and often coincide; in particular, the kernels are similar to the exponential kernel described in the following proposition.

Proposition 1

Consider the BART model (2.3) constructed so that (i) for any η ∈ ℒt, D(η) ~ Poisson(λ); (ii) for any η ∈ ℐt, j(η) ~ Categorical(s); (iii) given j(η), Cη ~ Uniform(0, 1) (allowing for logically empty nodes); (iv) given, $μη~iidNormal(0,σμ2)$. Then the associated kernel function$Σ(x,x′)=σμ2Pr(x~x′)$is given by

$Σ(x,x′)=σμ2 exp(-λ∑j=1Psj∣xj-xj′∣).$

A very similar result is obtained by Balog et al. (2016) in the context of their Mondrian forests algorithm, giving some evidence that all of these methods are fundamentally making use of the same information about the dissimilarity of the points (x, x′) in Euclidean space. Balog et al. (2016) use this connection to approximate inference under the “ideal” kernel (5.2). We note that samples from the BART prior could also be used for such a purpose.

A natural question one might ask is “given this connection, why not use Gaussian process regression directly?” First, Gaussian process regression does not scale favorably with the sample size; without making any approximations, training a Gaussian process regression model requires Θ(n3) computations. Furthermore, our experience is that the empirical performance of a minimally-tuned implementation of BART is frequently better than Gaussian process regression using the equivalent kernel ∑(x, x′). That is, BART provides better performance using less computation. We conjecture that the reason for BART outperforming Gaussian process regression is that limiting the number of trees in the ensemble allows one to learn a data-adaptive notion of distance between points.

A natural question one might ask is “given this connection, why not use Gaussian process regression directly?” First, Gaussian process regression does not scale favorably with the sample size; without making any approximations, training a Gaussian process regression model requires Θ(n3) computations. Furthermore, our experience is that the empirical performance of a minimally-tuned implementation of BART is frequently better than Gaussian process regression using the equivalent kernel ∑(x, x′). That is, BART provides better performance using less computation. We conjecture that the reason for BART outperforming Gaussian process regression is that limiting the number of trees in the ensemble allows one to learn a data-adaptive notion of distance between points.

### 5.3. Connections to boosting

The original motivation for the BART algorithm given by Chipman et al. (2010) was to provide a Bayesian analog to boosted decision trees (Freund et al., 1999). Boosted decision tree methods take essentially the same form as BART, setting $μ^(x)=∑t=1Tg(x;Tt,Θt)$ and, like BART, a preference is given for the individual trees to be weak learners. Unlike random forests, ensembles of decision trees obtained from boosting are typically such that (i) the individual decision trees are shallow and (ii) no single decision tree is sufficient to provide an adequate fit to the data on its own.

BART differs from boosting in a number of ways. First, boosting algorithms are greedy, and do not update trees once they are included in the model. By contrast, BART makes use of a Bayesian backfitting which continuously updates all trees in the model. Second, boosting algorithms run the risk of overfitting the data as more trees are added, while BART is shielded from overfitting as more trees are added due to its link with Gaussian processes noted in Section 5.2. Third, the Bayesian paradigm also allows for automatic tuning of hyperparameters through the use of priors, while boosting is typically tuned by cross-validation.

6. Frequentist properties of trees and tree ensembles

Beyond consistency, one may be interested in the convergence rate of the posterior. A natural target to aim for is the minimax rate for a particular estimation problem. For the semiparametric regression problem Yi = μ0(Xi) + εi z$ɛi~iidNormal(0,σ02)$, the optimal rate of convergence is εn = nd/(2d+Q0) whenever f0(x) is d-times differentiable and depends on Q0P predictors (more precisely, the assumption is that f0(x) is a d-Hölder function; see, e.g., Yang and Tokdar, 2015) where, for simplicity, P and Q0 are regarded as fixed. This problem was considered by Linero and Yang (2017), whose results imply that the BART posterior concentrates at the rate εn (up-to logarithmic factors), d ∈ (0, 1]. Independently, Rockova and van der Pas (2017), using a specifically chosen prior , established contraction of the posterior BART and BCART priors at the rate εn (up-to logarithmic factors) under essentially the same assumptions.

The limitation d ∈ (0, 1] described above is informative. When d = 1, μ0(x) is a Lipschitz-continuous function; d < 1 corresponds to fractionally smooth μ0(x); and d > 1 corresponds to functions which are smoother than Lipschitz-continuous. This suggests that BART and BCART can adapt to rough functions (d ≤ 1) but cannot take advantage of high-order smoothness in μ0(x). Linero and Yang (2017) show that this barrier can be overcome by smoothing the decision trees in BART, leading to the SBART model.

In view of Rockova and van der Pas (2017), there is no difference in convergence rate between BART and BCART for the problem of estimating a sparse function μ0(x); hence, the results described above do not fully account for the improved performance of BART relative to BCART. There are several practical reasons for expecting BART to outperform BCART. In particular, BART may outperform BCART for the following reasons: (i) BART uses shallow trees, so the mixing of the associated Markov chain is much better; (ii) realizations of BART are smoother, so that BART is better able to borrow information from neighboring points; and (iii) functions μ0(x) which arise in practice are more structured than merely being sparse, and BART takes advantage of this structure. We focus on (iii) here. Yang and Tokdar (2015) establish additional minimax rates of convergence when μ0(x) admits a sparse additive structure $μ0(x)=∑v=1Vμ0v(x)$. In particular, if μ0v(x) is αv-smooth and Q0v sparse, the minimax rate of convergence is simply obtained by summing the squared component-specific convergence rates, i.e., the rate is $ɛn★={∑v=1Vn-2αv/(2αv+Q0v)}1/2$. Comparing $ɛn★$ to εn, note that $ɛn★$ can be substantially smaller than εn as long as the number of interactions between the predictors is small. if the true model is additive, for example, the $ɛn★$ is limited only by the number V of additive components. Linero and Yang (2017) and Rockova and van der Pas (2017) independently show that the BART posterior concentrates at the rate $ɛn★$ up-to logarithmic factors (assuming αv ∈ (0, 1] for BART and αv > 0 for SBART). This fact hinges crucially on the fact that BART decomposes additively; hence, the BCART posterior is not thought to concentrate at the rate $ɛn★$.

7. Discussion

In this paper we have reviewed Bayesian tree-based models, focusing on recent developments in computation and modeling. We also outlined connections to more traditional machine learning algorithms and reviewed some exciting results on posterior consistency which fill a previously long-standing gap between theory and practice. Much work remains to be done; in particular, further improvements in computation would make these methods more feasible in structured regression problems where boosting algorithms currently dominate (Nielsen, 2016).

Also important to the dissemination of these methods is the development of software, and currently there are many options. The fits in this paper were obtained using the CRAN packages tgp for treed Gaussian processes, dbarts for BART, and dynaTree for dynamic trees, and the non- CRAN package SoftBart available at www.github.com/theodds/SoftBART. Other packages of note include bartMachine and BART for fitting BART models.

Acknowledgements

This research was partially supported by NSF grant DMS-1712870 and DOD grant SOT-FSU-FATS-16-06.

Figures
Fig. 1. Schematic depicting the construction of a decision tree (bottom) and the induced recursive partition associated with the construction of the tree (top). After the decision tree is constructed, parameters associated to leaf node η is given a mean parameter μ.
Fig. 2. Fits of the Bayesian classification and regression trees model in which the within-leaf response mean is (left) a constant μη, (middle) a Gaussian process μη(x), and (right) a linear model μη(x) = αη + xβη.
Fig. 3. Fits of various methods to the dataset. BART = Bayesian additive regression trees; BCART = Bayesian classification and regression trees; SBART = smoothed Bayesian additive regression trees.
Fig. 4. Integrated RMSE under (4.1) across 30 independent replications, with n = 250,σ2 = 1. SBART = smoothed Bayesian additive regression trees; BART = Bayesian additive regression trees; BCART = Bayesian classification and regression trees; RMSE = root mean squared error.
Fig. 5. Error rates for the competing methods on the Wisconsin breast cancer dataset for each fold in the cross-validation experiment. BART = Bayesian additive regression trees; BCART = Bayesian classification and regression trees.
Fig. 6. Left: histogram of the raw data, where Y denotes the number of reviews a given restaurant has; the estimated mean μ and variance σ2 are also given. Right: kernel density estimator of the distribution of log(Y) (solid) overlaid with the distribution of log(Y) under the negative binomial model with α = 0.65 (dashed).
Fig. 7. Plots of the data. On the left, points are colored by the log of their mean response value according to the model, while on the right the points are colored by the log of their realized value. Figure created using the package (Kahle and Wickham, 2013).
References
1. Albert, JH, and Chib, S (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 88, 669-679.
2. Balog, M, Lakshminarayanan, B, Ghahramani, Z, Roy, DM, and Teh, YW 2016. The Mondrian kernel., Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, New York, pp.32-41.
3. Breiman, L (1996). Bagging predictors. Machine Learning. 24, 123-140.
4. Breiman, L (2000). Some infinity theory for predictor ensembles. Berkeley: Department of Statistics, University of California
5. Breiman, L (2001). Random forests. Machine Learning. 45, 5-32.
6. Breiman, L, Friedman, J, Stone, CJ, and Olshen, RA (1984). Classification and Regression Trees. Belmont: Wadsworth International Group
7. Cappé, O, Godsill, SJ, and Moulines, E (2007). An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE. 95, 899-924.
8. Chipman, H, George, EI, Gramacy, RB, and McCulloch, R (2013). Bayesian treed response surface models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 3, 298-305.
9. Chipman, HA, George, EI, and McCulloch, RE (1998). Bayesian CART model search. Journal of the American Statistical Association. 93, 935-948.
10. Chipman, HA, George, EI, and McCulloch, RE (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics. 4, 266-298.
11. Denison, DG, Mallick, BK, and Smith, AF (1998). A Bayesian CART algorithm. Biometrika. 85, 363-377.
12. Efron, B, and Gong, G (1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician. 37, 36-48.
13. Freund, Y, Schapire, R, and Abe, N (1999). A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence. 14, 771-780.
14. Friedman, JH (1991). Multivariate adaptive regression splines. The Annals of Statistics. 19, 1-67.
15. Gramacy, RB, and Lee, HKH (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association. 103, 1119-1130.
16. Hastie, T, and Tibshirani, R (1987). Generalized additive models: some applications. Journal of the American Statistical Association. 82, 371-386.
17. Hastie, T, and Tibshirani, R (2000). Bayesian backfitting (with comments and a rejoinder by the authors). Statistical Science. 15, 196-223.
18. Hastings, WK (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 57, 97-109.
19. Hill, JL (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics. 20, 217-240.
20. Irsoy, O, Yildiz, OT, and Alpaydin, E 2012. Soft decision trees., Proceedings of the 21st International Conference on Pattern Recognition, Tsukuba, Japan, pp.1879-1822.
21. Kahle, D, and Wickham, H (2013). ggmap: spatial visualization with ggplot2. The R Journal. 5, 144-161.
22. Kass, GV (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics. 29, 119-127.
23. Lakshminarayanan, B, Roy, DM, and Teh, YW 2013. Top-down particle filtering for Bayesian decision trees., Proceedings of the International Conference on Machine Learning, Atlanta, GA, pp.280-288.
24. Lakshminarayanan, B, Roy, DM, and Teh, YW 2014. Mondrian forests: efficient online random forests., Advances in Neural Information Processing Systems, Montreal, Canada, pp.3140-3148.
25. Lakshminarayanan, B, Roy, DM, Teh, YW, and Unit, G 2015. Particle Gibbs for Bayesian additive regression trees., Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, pp.553-561.
26. Lichman, M (2013). UCI machine learning repository, Retrieved November. 10, 2017.
27. Linero, AR (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association.https://doi.org/10.1080/01621459.2016.1264957
28. Linero, AR, and Yang, Y (2017). Bayesian regression tree ensembles that adapt to smoothness and sparsity.
29. Ma, L (2017). Recursive partitioning and multi-scale modeling on conditional densities. Electronic Journal of Statistics. 11, 1297-1325.
30. MacEachern, SN (2016). Nonparametric Bayesian methods: a gentle introduction and overview. Communications for Statistical Applications and Methods. 23, 445-466.
31. McCullagh, P (1984). Generalized linear models. European Journal of Operational Research. 16, 285-292.
32. Muliere, P, and Walker, S (1997). A Bayesian non-parametric approach to survival analysis using Pólya trees. Scandinavian Journal of Statistics. 24, 331-340.
33. Murray, JS (2017). Log-linear Bayesian additive regression trees for categorical and count responses.
34. Neal, RM 1995. . Bayesian learning for neural networks (Doctoral dissertation). University of Toronto. Toronto.
35. Nielsen, D 2016. . Tree boosting with XGBoost: why does XGBoost win “every” machine learning competition? (Master’s thesis). Norwegian University of Science and Technology. Trondheim.
36. Pratola, MT (2016). Efficient Metropolis-Hastings proposal mechanisms for Bayesian regression tree models. Bayesian Analysis. 11, 885-911.
37. Quinlan, JR (1993). C4 5: Programs for Machine Learning. San Mateo: Morgan Kaufmann
38. Rasmussen, CE, and Williams, CKI (2006). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). Cambridge: MIT Press
39. Rockova, V, and van der Pas, S (2017). Posterior concentration for Bayesian regression trees and their ensembles.
40. Roy, DM, and Teh, YW (2009). The Mondrian process. Advances in Neural Information Processing Systems. 21, 1377-1381.
41. Scornet, E (2016). Random forests and kernel methods. IEEE Transactions on Information Theory. 62, 1485-1500.
42. Sparapani, RA, Logan, BR, McCulloch, RE, and Laud, PW (2016). Nonparametric survival analysis using Bayesian additive regression trees (BART). Statistics in Medicine. 35, 2741-2753.
43. Taddy, MA, Gramacy, RB, and Polson, NG (2011). Dynamic trees for learning and design. Journal of the American Statistical Association. 106, 109-123.
44. Van der Vaart, AW, and Wellner, JA (1996). Weak Convergence and Empirical Processes: with Applications to Statistics. New Work: Springer
45. Wand, M (2014). SemiPar: Semiparametric Regression. R package (Version 1.0-4.1)
46. Wu, Y, Tjelmeland, H, and West, M (2007). Bayesian CART: prior specification and posterior simulation. Journal of Computational and Graphical Statistics. 16, 44-66.
47. Yang, Y, and Tokdar, ST (2015). Minimax-optimal nonparametric regression in high dimensions. The Annals of Statistics. 43, 652-674.