TEXT SIZE

search for



CrossRef (0)
Tilted beta regression and beta-binomial regression models: Mean and variance modeling
Communications for Statistical Applications and Methods 2024;31:263-277
Published online May 31, 2024
© 2024 Korean Statistical Society.

Edilberto Cepeda-Cuervo1,a

aDepartment of Statistics, Universidad Nacional de Colombia, Colombia
Correspondence to: 1 Department of Statistics, Universidad Nacional de Colombia, Bogotá-Colombia, Carrera 45 No 26-85, Bogotá D.C. 111321, Colombia. E-mail: ecepedac@unal.edu.co
Received February 19, 2023; Revised July 4, 2023; Accepted December 16, 2023.
 Abstract
This paper proposes new parameterizations of the tilted beta binomial distribution, obtained from the combination of the binomial distribution and the tilted beta distribution, where the beta component of the mixture is parameterized as a function of their mean and variance. These new parameterized distributions include as particular cases the beta rectangular binomial and the beta binomial distributions. After that, we propose new linear regression models to deal with overdispersed binomial datasets. These new models are defined from the proposed new parameterization of the tilted beta binomial distribution, and assume regression structures for the mean and variance parameters. These new linear regression models are fitted by applying Bayesian methods and using the OpenBUGS software. The proposed regression models are fitted to a school absenteeism dataset and to the seeds germination rate according to the type seed and root.
Keywords : count data, overdispersion, tilted beta distribution, binomial distribution, tilted beta binomial distribution, Bayesian approach
1. Introduction

The beta distribution is usually used to study continuous variables X that take values in an open interval (a, b), given that the random variable Y = (Xa)/(ba) takes values in the open interval (0, 1) and can be assumed to have beta distribution, B(p, q). Beta distribution appears in many applications, such as in the analysis of population growth, interest rates, disease incidence, and unemployment rates. In different fields, there is often a need to model continuous random variables that are assumed to follow the beta distribution as a function of a set of explanatory variables. For this type of analysis, Cepeda-Cuervo (2001) proposed the beta regression models, where the mean, μ = p/(p + q), and dispersion, ν = p + q, parameters follow regression structures. These beta regression models were appropriately extended by Simas et al. (2010), assuming the regression structure to be nonlinear in the mean and in the dispersion parameters, ν = p + q, who has provided valuable insights in the development of new research in recent years, especially using frequentist methods. Taking into account the conditional interpretation of ν, the mean and variance beta regression models are proposed by assuming that an appropriate function of the mean and variance parameters of the beta distribution follows a linear regression structure (Cepeda-Cuervo, 2015, 2023). These models improve the regression parameter inferences and interpretations.

In order to admit heavier tails in the beta distribution, Hahn (2008) proposed the beta rectangular distribution as a new distribution that, like the beta distribution, has the open interval (0, 1) as domain. The beta rectangular distribution consists of a convex combination between the beta distribution and the uniform distribution U(0, 1). Subsequently, Hahn and López Martín (2015) proposed the tilted beta distribution, that consists of a mixture of the beta distribution and the tilted distribution, which has as particular cases the beta rectangular distribution and the beta distribution. In this paper, from the mean and variance beta distribution, a new parameterization of the tilted beta distribution is proposed, where mean and variance beta distributions and beta rectangular distributions were applied to improve the above proposal by including the interpretation advantages of the mean and variance beta distributions.

In count datasets, it is often found that the variance of the response variable Y exceeds the theoretical variance of the binomial distribution. This phenomenon, known as extra-binomial variation, can lead to underestimation errors, lower efficiency of estimates and underestimation of the variance, which can in turn can generate incorrect inferences about the regression parameters or the credible intervals (Collet, 1991; Cox, 1983; Williams, 1982). Combining the beta distribution with the binomial distribution leads to the beta-binomial distribution. This distribution is normally used to model the number of successes obtained in a finite number of experiments, and to study overdispersed datasets.

There are several approaches to studying overdispersed binomial datasets. Hinde and Demetrio (1998) categorized the majority of overdispersed binomial models into two classes: (1) those in which a more general shape for the variance function is assumed by adding additional parameters; and (2) models in which it is assumed that the parameter p of the binomial distribution bin(m, p) is itself a random variable. In the first class, the double exponential family of distributions allows researchers to obtain double binomial models. This enables the inclusion of a second parameter, which regardless of the mean, that controls for the variance of the response variable and can be modeled from a subset of some explanatory variables (Efron, 1986). In the second class, the beta binomial distribution assumes that the response variable follows a binomial distribution while the probability parameter follows a beta distribution, B(a, b). When the beta distribution is parameterized in terms of its mean and the dispersion parameter (Cepeda-Cuervo, 2001), the beta binomial distribution is presented in terms of the mean and dispersion parameters as described by Cepeda-Cuervo and Cifuentes-Amado (2017).

To obtain a model with better flexibility from the beta binomial distribution, where the beta distribution assigns small probability tail values of p, Cepeda-Cuervo and Cifuentes-Amado (2017) proposed the tilted beta binomial distribution by assuming that p in the binomial distribution bin(m, p) follows a tilted beta distribution, which, in turn, assumes that p follows a tilted mean and dispersion beta distribution. As a particular case of this distribution, these authors defined the beta rectangular binomial distribution by assuming that the parameter of the binomial distribution p has a beta rectangular distribution. This distribution allows for defining more general overdispersion regression models than that defined from the beta binomial regression model, which produces better estimates of the regression parameters, credibility or confidence intervals, and statistical inferences in the analysis of overdispersed binomial datasets. The tilted beta-binomial distribution was applied by Hahn (2022) in the analysis of overdispersed data, which assumes maximum likelihood and Bayesian methods. He applied this distribution to the analysis of the population dataset from the 2010 US Census. He found that the tilted beta-binomial distribution provided a better fit than the beta-binomial distribution. As he reported, the tilted beta-binomial distribution generalized the beta-binomial distribution, and it is capable of modeling datasets with greater overdispersion than the beta-binomial distribution.

We take into account the restricted interpretation of the dispersion parameter ν in which a constant mean increases when the variance decreases and decreases when the variance increases. In this paper, we propose to improve the binomial and the tilted binomial distributions, assuming that p in the binomial distribution follows the mean and variance beta distribution proposed in Cepeda-Cuervo (2023). One advantage of this alternative versus the mean and dispersion regression models is that the interpretation of the dispersion parameter is only possible for fixed values of the mean (ν can be interpreted as a precision parameter in the sense that for the fixed values of μ, the variance decreases when ν increases), while in the proposed models, changes in the variance can be explained directly through changes in the explanatory variables corresponding to the variance regression structure. Thus, this paper proposes tilted beta binomial regression models where the mean and variance of the beta distribution, the mean of tilted distributions, and the mixture parameter follow regression structures.

This paper is organized as follows. After the introduction, in Section 2, the mean tilted distribution is presented. In Section 3, three parameterizations of the beta distribution are considered. In Section 4, the μσ2-tilted beta binomial distribution is introduced and the μσ2-rectangular beta binomial distribution is presented as a particular case. In Section 5, the μσ2-tilted beta binomial distributions are defined. Section 6 presents a summary of the mean and variance beta regression models proposed by Cepeda-Cuervo (2023). In Section 7, the tilted beta binomial regression models are defined. Finally, Section 8, reports the results of two applications. Section 8.1 contains the results of analyzing a school absenteeism dataset by applying μσ2-beta binomial regression models, and Section 8.2 descrribes the influence of the type of seed and root on the proportion of germinated seeds in each of 21 dishes by fitting a tilted beta binomial linear regression using the OpenBUGS software. The proposed model’s performance is compared with the binomial and beta binomial regression models.

2. Tilted distribution

The tilted distribution was proposed by Hahn and López Martín (2015), and the following alternative definition was proposed by Cepeda-Cuervo and Cifuentes-Amado (2017): a random variable Y follows a tilted distribution with parameter ν if its density function is given by:

c(yυ)=[2υ-2(2υ-1)y]I(0,1)(y), 듼 듼 듼0ν1.

The mean of Y, denoted by μt := E(Y|ν), is μt = (2 − υ)/3. Thus, by parameterizing the density function (2.1) in terms of μ, this density function is given by:

c(yμt)=[3(2μt-1)(2y-1)+1]I(0,1)(y),

where 1/3 ≤ μt ≤ 2/3 . The variance of a random variable Y that follows the density function (2.2) is given by Vt(Y) = μt(1 − μt) − 1/6. According to (2.2), it is clear that this distribution is equal to the uniform distribution when μt = 0.5, is leaning to the right when μt is smaller than 0.5, and leaning to the left when μt is bigger than 0.5.

3. 2-beta distribution

In this section, three parameterizations of the beta distributions are presented. A random variable Y follows a beta distribution if its density function is given by:

fB(yp,q)=Γ(p+q)Γ(p)Γ(q)yp-1(1-y)q-1I(0,1)(y),

where p > 0, q > 0 and Γ(·) denotes the gamma function. The mean and variance of Y, μb = E(Y) and σb2=Var(Y), are respectively given by μb = p/(p + q) and

σb2=pq(p+q)2(p+q+1).

From the beta density function (3.1), the mean (μb) and dispersion (ν) beta distribution (3.3) is defined, where ν = p + q. A random variable Y follows a μbν-beta distribution if its density function is given by:

fB(yμb,ν)=Γ(ν)Γ(μbν)Γ(ν(1-μb))yμbν-1(1-y)ν(1-μb)-1I(0,1)(y).

This parameterizations of the beta distribution, presented in Ferrari and Cribari-Neto (2004), was already proposed in the literature, for example by Jorgensen (1997) and Cepeda-Cuervo (2001), p. 63. In this parameterization of the beta distribution, the variance of Y is given by σ2 = μ(1 − μ)/(1 + ν). Thus, ν = p + q has an interpretation that for a fixed mean, the variance of Y increases when ν decreases and the variance of Y decreases when ν increases.

Finally, assuming the mean and variance parameterizations of the beta distribution, proposed in Cepeda-Cuervo (2015) and Cepeda-Cuervo (2023), the beta density function is given by (3.4), where φ = 12 and K=Γ(μb(1-μb)φ-1)/(Γ(μb2(1-μb)φ-μb)Γ(μb(1-μb)2φ-(1-μb))). This formulation of the beta density function is proposed in Cepeda-Cuervo (2023).

fB(yμb,σb2)=Kyμb2(1-μb)φ-μb-1(1-y)μb(1-μb)2φ-(1-μb)-1I(0,1)(y).

The advantage of the density function (3.4) arises from the limited interpretation of the dispersion parameter ν = p + q in (3.3), while in equation (3.4), the precision parameter is given by φ = 12, where σ2 = Var(Y).

4. Tilted beta distributions

This section presents a new parameterization of the tilted beta distribution proposed by Hahn and López Martín (2015), in terms of the mean (μb) and the variance (σ2) parameters of the beta distribution and the mean of the tilted distribution μt. This new parameterization of the tilted beta distribution is obtained from the convex combination of the μt-tilted distribution proposed by Cepeda-Cuervo and Cifuentes-Amado (2017) and the μbσ2-beta distribution proposed by Cepeda-Cuervo (2023).

  • Tilted μν-beta distribution.


    The tilted beta distribution was introduced by Hahn and López Martín (2015), as a convex combination of the tilted and the beta distributions. In the (μt, μb, ν, θ) parameterized form, this distribution was proposed in Cepeda-Cuervo and Cifuentes-Amado (2020) from a combination of the mean tilted distribution (2.2) and the mean and dispersion beta distribution (3.3). Thus, the (μt, μb, ν, θ) density function of this distribution is given by:


    f(yμt,μb,ν,θ)=θc(yμt)+(1-θ)fB(yμb,ν),

    where 0 < y < 1 and 0 ≤ θ ≤ 1. The notation Y ~ TB(μt, μb, ν, θ) is used to denote that Y follows this tilted beta parameterized distribution, with the mean and variance given by:


    E(Yμt,μb,ν,θ)=θμt+(1-θ)μbV(Yμt,μb,ν,θ)=E(Y2μt,μb,ν,θ)-E2(Yμt,μb,ν,θ)=[θEt(Y2)+(1-θ)Eb(Y2)]-[θμt+(1-θ)μb]2=θVt(Y)+(1-θ)Vb(Y)+θ(1-θ)(μt+μb)2,

    where Et(Y2) and Vt(Y) denote the expectation of Y2 and the variance of Y, by assuming that Y follows the tilted distribution, and Eb(Y2) and Vb(Y) denote the expectation of Y2 and the variance of Y, by assuming that Y follows the beta distribution (3.3).

    The rectangular beta distribution is a particular case of (4.1) when μt = 0.5 (the slope of the tilted distribution is zero). Thus, the rectangular beta density function is given by:


    f(yμ,ν,θ)=θ+(1-θ)fB(yμ,ν),

    where 0 < y < 1.


    The tilted beta distributions are appropriate to analyze datasets with larger variance than beta distributions with larger values of their density at the ends of the (0, 1) interval. For example, when μt = 0.5, the beta rectangular distribution (4.4) is obtained. For other values of μt, the tilted component of the mixture allocates more density on one side of the open (0, 1) interval and less to the other side. The tilted beta distribution has been studied and applied in the project management context by García Perez et al. (2016) and Udoumoh et al. (2017), among others.


  • Tilted mean and variance beta distribution.
    The mean and variance tilted beta distribution is introduced as the convex combination of the tilted distribution (2.2) and μbσ2-beta distribution (3.4). The density function of a random variable Y that follows this distribution is given by:


    f(yμt,μb,σb2,θ)=θc(yμt)+(1-θ)fb(yμb,σb2),

    where 0 < y < 1 and 0 ≤ θ ≤ 1. The notation Y ~ TB(μt, μb, σ2, θ) is used to denote that Y follows this distribution. The mean and the variance of Y are E(Y) = θμt + (1 − θ)μb and


    V(Y)=θσt2+(1-θ)σb2+θ(1-θ)(μt+μb)2,

    where σt2=Vt(Y) is the variance of the tilted distribution and σb2=Vb(Y) is the variance of the beta distribution.


  • Tilted mean and precision beta distribution.
    Given that precision is the inverse of variance, the tilted mean and precision beta distributions can be defined from (4.5) and written σb2 as 1b. The mean and (variance) precision beta rectangular distribution can be defined as a particular case.

The tilted mean and variance (or precision) beta distributions are appropriate to analyze datasets with larger variance than the μν-beta distributions, but the these have the advantage of clearer and simpler parameter interpretations.

5. Tilted beta binomial distributions

At the beginning of this section, in Subsection 5.1, the mean and the “dispersion” (ν = a+b) tilted beta binomial distribution is presented, following its definition proposed by Cepeda-Cuervo and Cifuentes-Amado (2017). After that, the mean and variance parametrization of this distribution is proposed in Subsection 5.2. Finally, in Subsection 5.3, following Cepeda-Cuervo (2023), we present the mean and variance beta binomial density function.

5.1. Tilted μν-beta binomial distributions

Let Y|p ~ bin(m, p) be a random variable that follows the binomial distribution, where p follows the tilted beta distribution, p ~ TB(μt, μb, ν, θ). Then Y follows a tilted beta binomial distribution with parameters μt, μb, ν and θ, which are denoted by Y ~ TBB(μt, μb, ν, θ), if their probability function is given by:

f(yμt,μb,φ,θ)=01fBin(ym,p)[θc(pμt)+(1-θ)fBeta(pμb,ν)]dp=2θ(my)[y(6μt-3)+m(2-3μt)+1m+2]B(y+1,m-y+1)+(1-θ)fBB(μb,ν)(y),

where y = 0, 1, . . . ,m; B(·, ·) denotes the beta function, and fBB(μb,ν)(·) denotes the density function of the beta binomial distribution, which is parameterized in terms of the mean and the dispersion parameters (Cepeda-Cuervo and Cifuentes-Amado, 2020).

The mean and variance of a random variable Y that follows the (μt, μb, ν, θ)-tilted beta binomial probability function are given by: E(Y) = E(E(Y|p)) = mE(p) = m [θμt + (1 − θ)μb] and

V(Y)=V(E(Yp))+E(V(Yp))=m2V(p)+mE(p)-mE(p2))=m{(m-1)V(p)+E(p)(1-E(p))}=m{(m-1)[θVt+(1-θ)Vb+θ(1-θ)(μt+μb)2]+[θμt+(1-θ)μb][1-θμt+(1-θ)μb]},

where μb and Vb denote the mean and variance of the beta distribution, respectively, and μt and Vt denote the mean and variance of the tilted beta distribution. The behavior of the (μt, μb, ν, θ)-tilted beta binomial probability function is illustrated in Cepeda-Cuervo and Cifuentes-Amado (2020), for different vectors of parameter values.

A particular case of this distribution is the Tilted (μb,ν,θ)-beta rectangular binomial distribution. Y follows this distribution if Y|p follows a binomial distribution, Y|p ~ bin(m, p), where p follows the beta rectangular distribution (4.4). This density function of Y can be obtained as a particular case of the tilted beta binomial distribution (5.1) by replacing μt with 0.5:

f(yμb,φ,θ)=(my)θB(y+1,m-y+1)+(1-θ)fBB(μb,ν)(yμb,ν),

where y = 0, 1, . . . ,m. From the equations of the mean (4.2) and variance (4.3) of the tilted beta binomial distribution with a setting μt = 0.5, the mean and variance of the rectangular beta distribution are given by:

E(Y)=m[θ2+(1-θ)μ]V(Y)=(m2-m)[μ(1-μ)1+ν(1-θ)(1+θ(1+φ))+θ12(4-3θ)]+m[θ2+(1-θ)μ][2-θ2-(1-θ)μ]

5.2. Tilted μσ2-beta binomial distributions

Let Y|p ~ bin(m, p) be a random variable that follows a binomial distribution, where p follows the tilted beta distribution, p ~ TB(μt, μb, σ2, θ). Then Y follows the tilted beta binomial distribution with parameters μt, μb, σ2 and θ, which are denoted by Y ~ TBB(μt, μb, σ2, θ). The probability of this distribution is given by:

f(yμt,μb,φ,θ)=2θ(my)[y(6μt-3)+m(2-3μt)+1m+2]B(y+1,m-y+1)+(1-θ)fBB(μb,σ2)(y),

which for μt = 0.5 is the beta rectangular binomial distribution.

A particular case of the distribution (5.3) is the (μb, σ2, θ)- tilted beta rectangular binomial distribution. In this case, Y follows the (μb,σ2,θ)-beta rectangular binomial distribution if Y|p ~ bin(m, p) is a random variable that follows a binomial distribution, and where p follows the beta rectangular distribution. The density function of this distribution can be obtained as a particular case of the tilted beta binomial distribution (5.3), by replacing μt with 1/2:

f(yμb,σ2,θ)=(my)θB(y+1,m-y+1)+(1-θ)fBB(μb,σ2)(y),

where y = 0, 1, . . . ,m.

5.3. μσ2-beta binomial distribution

The μσ2-beta binomial distribution, defined in Cepeda-Cuervo (2023), is obtained by assuming that a random variable Y follows a binomial distribution B(m, p), where p follows the μσ2-beta distribution (3.4). The μσ2-beta binomial probability function is given by:

f(yμb,σ2,θ)=(nr)B(y+μ(μ(1-μ)φ-1),m-y+(μ(1-μ)φ-1)(1-μ)B(μ(μ(1-μ)φ-1),(μ(1-μ)φ-1)(1-μ),

where 0 < μ < 1, φ = 12 and 0 < σ2 < 1/4. φ is the precision parameter of the beta distribution.

6. Mean and variance beta regression models

The beta regression model was proposed in Cepeda-Cuervo (2001), under a Bayesian framework by assuming that the mean (μ) and the dispersion (ν = a+b) parameters follow linear regression structures given by:

h(μi)=xitβ,g(νi)=zitγ,

where h is the logit function; g is the logarithmic function; and β = (β0, β1, . . . , βk)t and γ = (γ0, γ1, . . . , γp)t are the vectors of the mean and dispersion regression parameters, respectively; xi = (xi1, . . . , xik)t is the vector of the mean explanatory variables; and zi = (zi1, . . . , zip)t is the vector of the dispersion explanatory variables at the ith observation. A frequentist approach to the beta regression models was presented by Ferrari and Cribari-Neto (2004), assuming that h is an appropriate real valued function, strictly monotonic and twice differentiable, defined on the interval (0, 1), and ν is a constant dispersion parameter. These authors presented a wide range of applications where the practitioner needs to assume regression structures to explain the behavior of the variables of interest. Although many variations of mean and dispersion beta regression models have been developed in recent years, these proposals have at least two drawbacks. The first is the interpretability of the dispersion parameter ν, given that ν is considered to be a precision parameter for a constant mean, the variance decreases when ν increases. A second problem is the lack of an explicit regression structure for the variance, which impairs the quality of the posterior regression parameter inferences.

A first approach to the mean and variance beta regression models was proposed in Cepeda-Cuervo (2015) and a general definition was formulated in Cepeda-Cuervo (2023). In Cepeda-Cuervo (2023), the mean regression structure is given by (6.3) and the variance (or precision) regression structure is given by (6.4), where h(·) and g(·) are real functions defined in the open interval (0, 1), like the logit, probit, log-log and complementary log-log functions.

h(μi)=xitβ,g (4σi2)=ztγ.

If, as in Cepeda-Cuervo (2023), for example, the mean and variance of the beta regression model are given by logit(μ) = xtβ and logit(4σ2) = ztγ, then the parameter estimates of the mean and variance regression structures are easily interpretable.

  • If X1 is an explanatory variable associated with parameter β1 where β1 > 0, increasing behavior of X1 is associated with an increasing mean, and where β1 < 0, increasing behavior of X1 is associated with a decreasing mean.

  • If Z1 is an explanatory variable associated with parameter γ1 where γ1 > 0, increasing behavior of Z1 is associated with increasing variance, and where γ1 < 0, increasing behavior of Z1 is associated with decreasing variance.

The mean and precision (φ = 12) beta regression model can be defined by the mean regression structure (6.3) and by g(φ − 4) = ztγ, where g(·) is the logarithmic function or some other appropriate real function defined from the positive real number set to the real numbers, such as the logarithmic function.

The results of the statistical analysis of the dyslexic dataset presented in Cepeda-Cuervo (2023), and obtained by applying μσ2-beta regression models, reveal the good performance of this model and the easy interpretation of the posterior parameter inferences compared with that obtained from fitting the μν-beta regression model to this dataset. In the μσ2-beta regression model, the variance of the variable of interest is interpreted according to items 1 and 2 of this section, which is unconditional to the mean values. Thus, the mean and variance beta regression models, defined in (6.3) and (6.4), have a substantial interpretative advantage compared with the mean and “dispersion” models, defined by (6.1) and (6.2).

Additionally, in the results of simulation processes, the mean and variance models outperform the mean and dispersion models, which can be established by statistical methods. In these simulations, the explanatory variables can be generated from uniform distributions, the mean and dispersion parameters are obtained from their respective mean and dispersion structures, and the observations of the variable of interest are generated from the beta distributions. Finally, the beta regression models were fitted to the resulting dataset, and the model with the best fit was the mean and variance beta regression model, which had the smallest residuals.

With the new parameterization of the beta distributions proposed by Cepeda-Cuervo (2023), the tilted mean and variance beta regression model is defined from the mixture distribution (4.5), a convex combination of the tilted distribution (2.2) and the μbσ2-beta density function (3.4), where an appropriate function of their parameters follows linear the regression structures:

Let Yi~TB(μti,μbi,σbi2,θi), i = 1, 2, . . . , n, be independent random variables with tilted mean and variance beta distribution. Let xi = (xi1, . . . , xip)t, zi = (zi1, . . . , zik)t, wi = (wi1, . . . ,wil)t and xi = (xi1, . . . , xis)t be the covariate vectors of μbi, σbi2, θi and μti regression structures, and β = (β1, . . . , βp)t, γ = (γ1, . . . , γk)t, δ = (δ1, . . . , δl)t and α = (α1, . . . , αs)t, be the respective regression parameter vectors. Thus, the tilted mean and variance regression models are defined from the mean and variance tilted beta distribution (5.3) by assuming the following regression structures.

logit(μbi)=xitβ,log (4σbi2)=zitγ,logit(θi)=witδ,logit (3μti-1)=x˜itα.

This parameterization of the tilted beta distribution has some interpretive advantages, that is related to other parameterization of this distribution. In this parameterization, if Y follows a tilted beta distribution, then

E(Yi)=θ(exp(xitβ)1+exp(xitβ))+(1-θ)(exp(x˜itα)3+3exp(x˜itα)+13).

Thus, the contribution of an explanatory variable to the mean behavior of the variable of interest Y can be easily established. A similar argument can be established to explain the contribution of the explanatory variables to the behavior of the variance.

Many extensions of the mean and variance beta regression models can be proposed, which provide valuable insights Simas et al. (2010), by assuming nonlinear regression structures for the mean and variance of the beta regression models.

7. Tilted beta binomial regression models

In Item 1 of this section, the tilted beta binomial regression models defined in Cepeda-Cuervo and Cifuentes-Amado (2020), where μb, ν and θi follow regression structures, are extended to include a mean regression structure of the tilted mixture parameter components. Additionally, in Item 2, considering the reduced interpretation of the “dispersion” parameter in the beta density function (3.3) and in the tilted binomial distribution (5.1), we propose the μtμbσ2θ-tilted beta binomial regression models, where μt and σ2 also follow regression structures. Finally, in item 3, as a particular case of item 2, the μbσ2-beta binomial regression models, proposed in Cepeda-Cuervo (2023), are presented.

  • Tiltedμtμbνθ-beta binomial regression models: Let Yi ~ TBB(μti, μbi, νi, θi), i = 1, 2, . . . , n, be independent random variables with tilted beta binomial distribution. Let xi = (xi1, . . . , xip)t, zi = (zi1, . . . , zik)t, wi = (wi1, . . . ,wil)t and xi = (xi1, . . . , xis)t be the covariate vectors of the μbi, νi, θi and μti regression structures, and β = (β1, . . . , βp)t, γ = (γ1, . . . , γk)t, δ = (δ1, . . . , δl)t and α = (α1, . . . , αs)t, be the respective regression parameter vectors such that:


    logit(μbi)=xitβ,log(νi)=zitγ,logit(θi)=witδ,logit(3μti-1)=x˜itα.

    Thus, the likelihood function of the TBB(μti, μbi, νi, θi)-regression model is given by: L(μti,μbi,νi,θi)=inf(yμti,μbi,φ1,θi), where f (·|μti, μbi, φ1, θi) is given by (7.5).


    f(yiμti,μbi,φ1,θi)=2θi(miyi)[yi(6μti-3)+mi(2-3μti)+1mi+2]×B(yi+1,mi-yi+1)+(1-θi)fBB(μbi,νi)(yi),

  • μtμbσ2θ-tilted beta binomial regression models. These models are defined from the μtμbσ2θ-tilted beta binomial distribution (5.3) by assuming the following regression structures: (7.1) for μb, (7.3) for θi, (7.4) for μt, and logit(4σi2)=zitγ for σi2.

  • μbσ2-beta binomial regression models. These models are defining from the μbσ2-beta binomial distribution given in 5.3 by assuming the following regression structures: (7.1) for μb and logit(4σi2)=zitγ for σi2, as proposed in Cepeda-Cuervo (2023).

Hahn (2022), in his applications and simulations, established that the performance of the tilted beta-binomial distribution is better than the beta-binomial model, including a first application to big data. The author found evidence for the existence of the beta-binomial and tilted binomial components in applications of demographic datasets.

8. Applications

This section includes posterior parameter inferences that are obtained by applying the μtμbσ2θ-tilted beta binomial regression models to analyze the school absenteeism dataset in Section 8.1 and seed germination dataset in Section 8.2. In both cases, in order to define the Bayesian tilted beta binomial regression model, the following a priori distributions are assumed for the regression parameters: β ~ N(0, B), γ ~ N(0,G), δ ~ N(0, D) and α ~ N(0, D). If there are no explanatory variables for μt, then (7.4), as given by logit(3μti − 1) = α0 and α0 ~ N(0, 10k), where k is a positive real number, can be assumed to be the prior distribution. Also, given that μt ~ U(1/3, 2/3), the uniform distribution μt ~ U(1/3, 2/3) can be assumed as the prior distribution of μt.

8.1. School absenteeism dataset

The first dataset analyzed in this paper was originally presented in Quine (1975) and comes from a sociological study of Australian Aboriginal and White children from Walgett, New South Wales with nearly equal numbers between the two sexes and equal numbers from between the two cultural groups. Children were classified by culture, age, sex, and learner status; and the number of days absent from school in a particular school year was recorded. In this dataset, the response variable of interest is the number of days that a child was absent during the school year (days absent: Y). The explanatory variables are the following factors with two levels:

  • Cultural or ethnic background (CB): Aboriginal (0) and White (1).

  • Learning ability (LA): Slow learner (0) Average learner (1).

Since the variable Days Absent, Y, counts the number of events that occurred during a year, this dataset was analyzed by Cepeda-Cuervo and Cifuentes-Amado (2017) assuming a negative binomial model NB(μ, α), where the mean and the shape parameters follow linear regression structures . In this paper, assuming that a school year has 200 days, we analyze this dataset by applying the μtμbσ2θ-tilted beta binomial regression model, and assume the following linear regression structures:

logit(μbi)=β0+β3CBi+β4LAilogit(4σi2)=γ0+γ2LAilogit(θ)=g0logit(3(μt-1/3))=d0.

This tilted beta binomial regression model was fitted to the dataset by applying Bayesian methods and using the OpenBugs software. Thus, assuming the mean and variance regression structures given by (8.1) to (8.4), respectively; the posterior parameter estimates, standard deviations and 95% credible intervals are given in Table 1 (Model 1). Thus, from the posterior samples of g0 and d0, and assuming the mean and variance regression structures given by (8.3) to (8.4), a posterior sample of θ and μt were obtained, respectively, with the posterior parameter estimates and the respective standard deviations (between parentheses) being θ = 0, 0013368(0.0005343), μt = 0.4789(0.1520). From these estimates, it is clear that the parameter estimate of θ is close to zero. For this reason, a μσ2-beta binomial regression model, defined by (8.1) and (8.2) was fitted. Their parameter estimates are reported in the same table (Model 2).

The parameter estimates of the mean and variance regression structures of Model 2 agree with those of Model 1. The deviance information criterion (DIC) values and the sum of square errors are similar, but the DIC value of Model 1 is a little smaller than the DIC value of Model 2. In both models, the estimates of γ2 are negative, which shows that decreasing values of LA are associated with increasing variance behavior.

The posterior credibility interval for a regression parameter is given by the real numbers LI and LS, LI < LS such that the posterior probability, for which the parameter estimates lie between LI and LS, is 95%. These real numbers were obtained from the posterior sample assuming extreme tail samples of 2.5%.

8.2. Seed germination dataset

The dataset analyzed in this section is available in Spiegelhalter et al. (2003) and corresponds to the number of seeds that germinated from an initial quantity arranged in each of 21 dishes organized according to a 2 by 2 factorial design (2 seed types and 2 root types). This data was initially reported by Crowder (1978). The variables involved in the experiment are described as follows:

  • Y: number of seeds germinated in each dish.

  • n: number of seeds initially arranged in each dish.

  • X1: seed type (0) if it is O. aegyptiaca 73 and (1) if it is O. aegyptica 75.

  • X2: root type (0) if it is a bean and (1) if it is a cucumber.

In this experiment, there are 21 observations (21 dishes). Since the variable Y represents the number of germinated seeds in each dish, this variable can be assumed to follow the TBB(μt,μb,σ2,θ) distribution, and thus, the seed germination dataset can be analyzed by applying the TBB linear regression model defined by the regression structures given in equations (7.1) to (7.4), which include all the explanatory variables in each of the regression structures. After the process of eliminating the explanatory variables, the best model (smallest DIC value) has the following regression structures:

logit(μib)=β0+β1x1i+β2x2ilogit (4σi2)=γ0+γ1x2ilogit(θi)=c0+c1x2i

with constant tilted mean μt. Thus, assuming normal prior distribution N(0, 10k) with k = 5, for the regression parameters (βi, γi and ci, i = 1, 2, 3) and uniform distribution μt ~ U(1/3, 2/3) for the mean of tilted distribution, the TBB(μt,μb, σ2, θ) model was fitted to this dataset using OpenBUGS, which isa free program used to fit Bayesian models that apply Gibbs algorithms (Spiegelhalter et al., 2003). The posterior parameter inferences obtained from a sample of size 100000 with a burn-in of 10000 and taking one sample every 10 to reduce autocorrelation, are summarized in Table 2 (Model 3).

Given that 0 belongs to the 95% credible intervals of γ1 and c1, a tilted μσ2-beta binomial model with regression structures from (8.5) to (8.7) and without x2 in the variance and mixture regression structures, was fitted to this dataset and their posterior parameter inferences reported in Table 2 (Model 4). From this table, it is possible to conclude that Model 4 is the best (smallest DIC value, smallest SSE and all the null hypotheses of the regression parameters rejected).

To compare the performance of the μtμbσ2θ- and the μtμbνθ-tilted beta binomial regression models in the analysis of the seed germination dataset, Table 3 presents the posterior parameter estimates that are obtained when the μtμbνθ-tilted beta binomial regression models were fitted to this dataset, and assuming the regression structures given by:

logit(μib)=β0+β1x1i+β2x2ilog(νi)=γ0+γ1x2ilogit(θi)=c0+c1x2i,

with constant mean μt of the tilted distribution, which assume the same prior distributions, like in the first application. The posterior parameter estimates of the tilted beta binomial model with regression structures given by (8.8), (8.9) and (8.10) are reported in Table 3, Model 5. The parameter estimates of the reduced tilted μν-beta binomial regression models are given in the same table (Model 6).

From the results reported in these tables, it is possible to conclude that the estimates of the means and mixture regression parameter structures agree with the μtμbνθ and μtμbσ2θ tilted beta binomial regression models. The parameter estimates of the ν and σ2 regression structures are congruent with their parameter definitions. The DIC and SSE values are the smallest for Model 4 among the μtμbσ2θ-tilted beta binomial regression models. Thus, in this application, the Model 4 is assumed to be the best.

9. Conclusions

This paper proposes the mean and variance parameterizations of the tilted beta binomial distribution, which include two particular cases: Mean and variance beta rectangular distributions and mean and variance beta binomial distribution, that improves the parameter interpretation of these distributions defined from the mean and “dispersion” (ν = p + q) parameterizations of the beta distribution. From the new parameterized distributions, new linear regression models that deal with overdispersed binomial datasets are proposed, where the mean and variance of the beta distribution, the mean of the tilted distribution and the mixture parameter follow regression structures. These new linear regression models were fitted to the school absenteeism dataset and to the seed germination rate, which depended depending on the type of seed chosen, by applying Bayesian methods and using the Open-BUGS software. The models show good performance and a clear interpretation of their regression parameters.

Many extensions of these models can be proposed. One possibility is to use maximum likelihood methods to fit the proposed models. Additionally, following Simas et al. (2010), the tilted beta and beta-binomial nonlinear regression models can be formulated by assuming nonlinear regression structures for the mean and variance of the beta distribution component in the mixture.

According to Hahn (2022), the tilted beta-binomial distribution is clearly more appropriate than the beta-binomial distribution to analyze datasets with overdispersion. Thus, taking into account the better interpretability of the mean and variance, the tilted beta regression models can be proposed as good options for analyzing analyze these types of overdispersed count exit/failure datasets.

In many areas of knowledge there is a wide range of applications with random variables of interest of the exit/failure type, where the tilted mean and variance beta regression models can be applied. These types of data analysis are also possible extensions that can be used by researchers and students of statistics, and in the addition to studies and development of statistical packages for fitting the proposed models.

TABLES

Table 1

Parameter estimates of the μtμbσ2θ-tilted beta binomial model (Model 1) and the μσ2-beta binomial model (Model 2) in the analysis of the school absenteeism dataset

Model 1Model 2

Param.Mean (S.D.)95% Cred. Int.Mean (S.D.)95% Cred. Int
β0−1.985(0.149)(−2.25,−1.683)−1.976(0.142)(−2.250,−1.693)
β3−0.531(0.111)(−0.736,−0.335)−0.533(0.107)(−0.743,−0.323)
β4−0.490(0.198)(−0.821,−0.117)−0.501(0.192)(−0.869,−0.121)
γ0−3.363(0.261)(−3.836,−2.76)−3.344(0.264)(−3.833,−2.811)
γ2−0.984(0.402)(−1.654,−0.325)−1.007(0.396)(−1.771,−0.207)
g0−11.450(5.430)(−23.840, −4.152)- - -- - -
d0−0.951(10.480)(−21.460, 19.720)- - -- - -

DIC(Dhat)−372.7 (−383.2)−372.3 (−383.1)
SSE0.62640.6259

Table 2

Parameter estimates of the μtμbσ2θ-tilted beta binomial regression models in the analysis of the seed germination dataset

Model 3Model 4

1–5 Param.Mean (S.D.)95% Cred. Int.Mean (S.D.)95% Cred. Int
β0−0.822(0.266)(−1.404,−0.336)−0.774(0.264)(−8.003,−0.960)
β10.466(0.238)(−0.001,0.935)0.374(0.264)(−0.102,0.922)
β21.040(0.247)(0.559,1.540)1.021(0.248)(0.528,1.515)
γ0−3.436(1.014)(−5.891,−1.715)−3.864(0.969)(−6.219,−2.422)
γ1−1.832(1.847)(−5.930,1.358)
c0−3.552(1.979)(−7.780,−0.058)−3.903(1.841)(−8.003,−0.960)
c1−1.600(2.596)(−7.243,−1.358)
μt0.498(0.092)(0.005,0.651)0.4922 (0.092)(0.347,0.652)

DIC122.4122.3
SSE420.205405.124

Table 3

Parameter estimates of the tilted μν-beta binomial regression models in the analysis of the seed germination dataset

Model 5Model 6

Param.Mean (S.D.)95% Cred. Int.Mean (S.D.)95% Cred. Int
β0−0.814(0.392)(−1.429, −0.268)−0.759(0.252)(−1.292,−0.307)
β10.444 (0.253)(−0.053, 0.934)0.384 (0.258)(−0.097,0.923)
β21.057 (0.3632)(0.5149, 1.61)1.016 (0.2484)(0.5147, 1.49)
γ03.388 (1.112)(1.561, 6.10)3.690 (0.894)(2.331, 5.865)
γ11.742(2.041)(−1.745,6.507,)
c0−3.399 (2.180)(−8.018, 0.547)−3.803 (1.818)(−7.811, −0.9436)
c1−1.663 (2.533)(−6.800, 3.169)
μt0.485 (0.092)(0.345, 0.482)0.491 (0.094)(0.346, 0.651)

DIC123.2123.1
SSE411.233408.215

References
  1. Cepeda-Cuervo E (2001) Modelagem da Variabilidade em Modelos Lineares Generalizados. (Unpublished Ph.D. thesis): Mathematics Institute, Universidade Federal do Rio de Janeiro.
  2. Cepeda-Cuervo E (2015). Beta regression models: Joint mean and variance modeling. Journal of Statistical Theory and Practice, 9, 134-145.
    CrossRef
  3. Cepeda-Cuervo E (2023). μσ2-beta and μσ2-beta binomial regression models. Revista Colombiana de Estadística, 46, 63-79.
    CrossRef
  4. Cepeda-Cuervo E and Cifuentes-Amado MV (2017). Double generalized beta-binomial and negative binomial regression models. Revista Colombiana de Estadística, 40, 141-163.
    CrossRef
  5. Cepeda-Cuervo E and Cifuentes-Amado MV (2020). Tilted beta binomial linear regression model: A bayesian approach. Journal of Mathematics and Statistics, 16, 1-8.
    CrossRef
  6. Collet D (1991). Modeling Binary Data, Chapman Hall, London.
  7. Cox D (1983). Some remarks on overdispersion. Biometrika, 70, 269-274.
    CrossRef
  8. Crowder MJ (1978). Beta-binomial ANOVA for proportions. Applied Statistics, 27, 34-37.
    CrossRef
  9. Efron B (1986). Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association, 81, 709-721.
    CrossRef
  10. Ferrari S and Cribari-Neto F (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31, 799-815.
    CrossRef
  11. García Perez J, López MM, García García C, and MA SG (2016). Project management under uncertainty beyond beta: The generalized bicubic distribution. Operations Research Perspectives, 67-76.
    CrossRef
  12. Pérez JG, Martín MD, M. L., García CG, and GraneroMÁS (2016). Project management under uncertainty beyond beta: The generalized bicubic distribution. Operations Research Perspectives, 3, 67-76.
    CrossRef
  13. Hahn E (2008). Mixture densities for project management activity times: A robust approach to pert. European Journal of Operational Research, 188, 450-459.
    CrossRef
  14. Hahn E (2022). The tilted beta-binomial distribution in overdispersed data: Maximum likelihood and bayesian estimation. Journal of Statistical Theory and Practice, 16, 43.
    CrossRef
  15. Hahn E and López Martín M (2015). Robust project management with the tilted beta distribution. SORT, 39, 253-272.
  16. Hinde J and Demetrio C (1998). Overdispersion: Models and estimation. Computational Statistics & Data Analysis, 27, 151-170.
    CrossRef
  17. Jorgensen B (1997). Proper dispersion models. Brazilian Journal of Probability and Statistics, 11, 89-128.
  18. Quine S (1975). (Achievement orientation of aboriginal and white australian adolescents (Ph.d. thesis)) , Australian National University, Australia.
  19. Simas AB, Barreto-Souza W, and Rocha AV (2010). Improved estimators for a general class of beta regression models. Computational Statistics and Data Analysis, 54, 348-366.
    CrossRef
  20. Spiegelhalter DJ, Thomas A, Best N, and Lunn D (2003). WinBUGS Version 1. 4, MRC Biostatistics Unit, Cambridge, UK.
  21. Udoumoh EF, Ebong DW, and Iwok IA (2017). Simulation of project completion time with Burr XII activity distribution. Asian Research Journal of Mathematics, 6, 1-14.
    CrossRef
  22. Williams D (1982). Extra-binomial variation in logistic linear models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31, 144-148.
    CrossRef