TEXT SIZE

search for



CrossRef (0)
The skew- censored regression model: parameter estimation via an EM-type algorithm
Communications for Statistical Applications and Methods 2022;29:333-351
Published online May 31, 2022
© 2022 Korean Statistical Society.

Victor H. Lachos1,a, Jorge L. Bazánb, Luis M. Castrocde, Jiwon Parka

aDepartment of Statistics, University of Connecticut, USA;
bDepartment of Applied Mathematics and Statistics, Universidade of São Paulo, Brazil;
cDepartment of Statistics, Pontificia Universidad Católica de Chile, Santiago, Chile;
dMillennium Nucleus Center for the Discovery of Structures in Complex Data, Santiago, Chile;
eCentro de Riesgos y Seguros UC, Pontificia Universidad Cat´ olica de Chile, Santiago, Chile
Correspondence to: 1 Department of Statistics, University of Connecticut, 215 Glenbrook Rd. U-4120, Storrs CT-06269, USA. E-mail: hlachos@uconn.edu
Received October 19, 2021; Revised February 24, 2022; Accepted April 19, 2022.
 Abstract
The skew-t distribution is an attractive family of asymmetrical heavy-tailed densities that includes the normal, skew-normal and Student’s-t distributions as special cases. In this work, we propose an EM-type algorithm for computing the maximum likelihood estimates for skew-t linear regression models with censored response. In contrast with previous proposals, this algorithm uses analytical expressions at the E-step, as opposed to Monte Carlo simulations. These expressions rely on formulas for the mean and variance of a truncated skew-t distribution, and can be computed using the R library MomTrunc. The standard errors, the prediction of unobserved values of the response and the log-likelihood function are obtained as a by-product. The proposed methodology is illustrated through the analyses of simulated and a real data application on Letter-Name Fluency test in Peruvian students.
Keywords : censored regression, EM-type algorithm, kurtosis, truncated moments, skewness, skew-t distribution
1. Introduction

Estimation of the censored regression (CR) model, or the Tobit model, has become quite common in the literature. However, as the Tobit model is based on normally distributed errors (N-CR), the maximum likelihood (ML) estimator is inconsistent if the underlying errors are not normally distributed. This inconsistency in the Tobit model led to the development of less sensitive estimators to the assumption of normality. Several authors have studied CR models involving response variables with heavier tails than the normal distribution in recent years. For instance, Arellano-Valle et al. (2012) and Massuia et al. (2015) have studied CR models based on the Student’s-t distribution (T-CR). They demonstrated the robustness aspects of the T-CR model against outliers through extensive simulations by using the Expectation-Maximization (EM) algorithm, which is based on the first two moments of the truncated Student’s-t distribution. However, the T-CR model is not appropriate when the data simultaneously present skewness and heavy tails.

Recently, Massuia et al. (2017) have established a new link between the CR model and asymmetrical heavy tails distributions by using the class of scale mixtures of skew-normal (SMSN) distributions (SMSN-CR), which allows capturing, simultaneously, skewness and kurtosis and contains, as special cases, the normal (N), Student’s-t (T), skew-t (ST) and skew-normal (SN) distributions. Under the Bayesian paradigm, an efficient Markov chain Monte Carlo (MCMC) is introduced to carry out posterior inference. The method is implemented in the R package BayesCR (Garay et al., 2017b). It is important to stress that the Massuia et al. (2017b)’s approach does not need the computation of the truncated moments of the SMSN family of distributions for implementing the estimation algorithm. The proposed MCMC procedure only needs to sample from a truncated normal distribution, conditional on the latent variables considered for each member of the SMSN class.

More recently and from the likelihood-based inference viewpoint, Mattos et al. (2018) proposed an efficient Monte Carlo EM (MCEM) algorithm to compute the ML estimates of the SMSN-CR model based on the stochastic approximation of the EM (SAEM) algorithm. However, by its nature, the SAEM algorithm is an expensive proposal, due to the combination of Monte Carlo simulations and other iterative procedures, which make this algorithm challenging to assess the convergence and time consuming. For this reason, our paper proposes an alternative analytically simple EM-type algorithm for computing ML estimates of the skew-t censored regression (ST-CR) model. We show that the E-step reduces to computing the first two moments of a certain truncated skew-t (TST) distribution. The general formulas for these moments were recently derived by Lachos et al. (2020), for which we will use the MomTrunc R package (Galarza et al., 2020). The likelihood function is easily computed as a byproduct of the E-step and is used for monitoring convergence and for model selection. Furthermore, we consider a general information-based method for obtaining the asymptotic covariance matrix of the ML estimate. Our proposal has two advantages over the existing ones (i.e., MCEM and SAEM algorithms). The first is that our EM-type algorithm is exact and does not require approximations at the E and M steps. The second one is that our approach is less time-consuming with the same precision (in terms of point estimation) related to its competitors.

The rest of the paper is organized as follows. In Section 2 we introduce some notation and outline the main results related to the SN, ST and TST distributions. In Section 3, the ST censored regression model (ST-CR) and related likelihood-based inference are presented, including the implementation of an EM-type algorithm called the Expectation/Conditional Maximization Either (ECME) algorithm (Liu and Rubin, 1994) for obtaining ML estimates of the parameters. Section 4 presents a simulation study to illustrate the performance of the proposed method. Section 5 discusses an application using a real data application on Letter-Name Fluency (LNF) test in Peruvian students, which is a standardized, individually administered test that provides a measure of Letter-Name Knowledge (LNK) and spelling abilities. Finally, Section 6 concludes with some discussion and possible directions for future research.

2. Notation and background

Throughout this paper, Np(μ, ) denotes the p-variate normal distribution with mean vector μ and covariance matrix ; and φp(· | μ, ) and Φp(· | μ, ) denote its probability density function (pdf) and cumulative distribution function (cdf), respectively. When p = 1 we drop the index p. In this case, if μ = 0 and σ2 = 1, we write φ(·) for the pdf and Φ(·) for the cdf. In the same way, Tp(μ, , ν) denotes the p-variate Student’s-t distribution with mean vector μ ∈ ℛp, scale matrix (a positive definite matrix) and degrees of freedom ν ∈ (0, ∞); and tp (· | μ, , ν) and Tp(· | μ, , ν) denote its pdf and cdf, respectively. Again, when p = 1 we drop the index p. In this case, if μ = 0 and σ2 = 1, we write t(· | ν) for the pdf and T(· | ν) for the cdf. Finally, Gamma(ν/2, ν/2) denotes the Gamma distribution with scale and shape parameters equal to ν/2.

We start defining the skew-normal (SN) distribution and then we introduce some useful properties. As defined by Azzalini (1985), a random variable Z has a skew-normal distribution with location parameter μ ∈ ℛ, scale parameter σ2 ∈ (0, ∞) and skewness parameter λ ∈ ℛ, denoted by Z ~ SN(μ, σ2, λ), if its pdf is given by φSN(z | μ, σ2, λ) = 2φ(z | μ, σ2)Φ(λ(zμ)). We denote the cdf of Z by ΦSN(· |μ, σ2, λ). If μ = 0 and σ2 = 1, we use φSN(· | λ) and ΦSN(· | λ) for the pdf and cdf, respectively.

As proved by Azzalini and Dalla Valle (1996, Eqn. 2.11), the cdf of a skew-normal random variable is given by

ΦSN(zμ,σ2,λ)=2Φ2(z-μσe1|0,Σ),

where e1 = (1, 0) and

Σ=(1-δ-δ1),         with   δ=λ1+λ2.

If Z ~ SN(μ, σ2, λ), then a convenient stochastic representation is given by

Z=μ+ΔT+Γ12T1,

where Δ = σδ, Γ = (1 − δ2)σ2, T = |T0|, and T0 and T1 are independent standard normal random variables. Here, | · | denotes the absolute value. It is important to note that this stochastic representation is useful to generate random samples, obtaining moments as well as to derive other interesting properties.

The next definition introduces the stochastic representation of a skew-t random variable.

Definition 1

Let Z ~ SN(0, σ2, λ) and U ~ Gamma(ν/2, ν/2) assuming that Z and U are independent. We say that the distribution of Y = μ + U−1/2Z is a skew-t distribution with location parameter μ ∈ ℛ, scale parameter σ2 ∈ (0, ∞), shape parameter λ ∈ ℛ and degrees of freedom ν ∈ (0, ∞). We use the notation Y ~ ST(μ, σ2, λ, ν).

From Definition 1, the density of Y takes the following form (Azzalini and Capitanio, 2003)

φST(yμ,σ2,λ,ν)=2t(yμ,σ2,ν)T((v+1d+ν)12A|ν+1),

where A = λ(yμ) and d = (yμ)22. Some particular cases of the skew-t distribution are the skew-Cauchy distribution (ν = 1) and the Student’s-t distribution (λ = 0). Also, when ν → ∞, the skew-normal distribution arises as a limit case. Moreover, from Definition 1, the conditional distribution of Y given U is

YU=u~SN(μ,u-1σ2,λ),         U~Gamma(ν2,ν2).

Thus, from (2.1), we obtain the following expression for the cdf of a skew-t random variable:

ΦST(yμ,σ2,λ,ν)=2E[Φ2(U12y-μσe1)|0,Σ]=2E [P(XU12y-μσe1|U)]=2P(XU12y-μσe1)=2T2(y-μσe1|0,Σ,ν),

where e1 = (1, 0), X ~ N2(0, ) and is given in (2.2).

In addition, from Basso et al. (2010), we have the following useful proposition related to a skew-t random variable.

Proposition 1

Suppose Y ~ ST(μ, σ2, λ, ν). Then, for r = 1, 2, …

  • Y | T = t, U = u ~ N (μ + Δt, u−1Γ), T | U = u ~ TN⌊0, +∞⌋ (0, u−1), U~Gamma(ν2,ν2),

  • T | Y = y, U=u~TN0,+(μT,u-1MT2),

  • γr=E[UrY=y]=2rΓ(ν+1+2r2)(v+d)-rΓ(ν+12)T((ν+1+2rν+d)12Aν+1+2r)T((v+1d+ν)12Aν+1),

  • τr=E[Ur2φ(U12A)Φ(U12A)Y=y]=2r-12Γ(ν+1+r2)π12Γ(ν+12)(ν+d+A2)ν+1+r2(ν+d)ν+12T((ν+1d+ν)12Aν+1),

  • E[U T | Y = y] = γ1μT + MT τ1and E[UT2Y=y]=γ1μT2+MT2+μTMTτ1,

where MT2=Γ/(Γ+Δ2), μT = {Δ/(Γ+Δ2)}(yμ), A = λ(yμ)/σ, d = (yμ)2/σ2and TNa,b(μ, σ2) denotes the truncated normal distribution in the intervala, b. Herea, bmeans that each extreme of the interval can be either open or closed.

Now, we introduce a key concept in our development, namely the truncated skew-t (TST) distribution.

Definition 2

Let Y ~ ST(μ, σ2, λ, ν), with P(a < Y < b) > 0 for some fixed a < b. A random variable X has a TST distribution in the intervala, b⌋, denoted by X ~ TSTa,b(μ, σ2, λ, ν), if it has the same distribution as Y|Y ∈ ⌊a, b⌋.

As an obvious consequence of Definition 2, we have that the pdf of X ~ TSTa,b(μ, σ2, λ, ν) is given by:

φTST(xμ,σ2,λ,ν;a,b)=φST(xμ,σ2,λ,ν)ΦST(bμ,σ2,λ,ν)-ΦST(aμ,σ2,λ,ν)Ia,b(x),

where denotes the indicator function, that is, if yB and otherwise.

An interesting property of the TST distribution is that it is a location-scale family. Indeed, let X ~ TSTα,β(0, 1, λ, ν), then Y = μ + σX has a TSTa,b(μ, σ2, λ, ν) distribution, where a = μ + σα and b = μ + σβ. Consequently, for computing moments of Y, it is enough to compute the moments of X. Thus, the nth moment of Y is given by

E [Yn]=k=0nn!(n-k)!k!σkμn-kE [Xk],         for   n=1,2,3....

Lachos et al. (2020) provide explicit expressions for the two first moments of the TST distribution. Let X ~ TSTa,b(0, 1, λ, ν), with a < b. Then,

E[X]=τ(a,b)[L(1){Φ(-0.5,bλ)-Φ(-0.5,aλ)}-{φSN(-0.5,b)-φSN(-0.5,a)}],E[X2]=τ(a,b)[{ΦSN(-1,b)-ΦSN(-1,a)}-L(2){Eφ(-1,bλ)-φ(-1,aλ)}-{bφSN(-0.5,b)-aφSN(-0.5,a)}],

where τ (a, b) = {ΦST (b|λ, ν) − ΦST (a|λ, ν)}−1, aλ = a(1 + λ2)1/2, bλ = b(1 + λ2)1/2, L(s) = (2/π)1/2λ/(1 + λ2)s/2, and

φSN(r,q)=2r+1νν2Γ(ν+2r2)2πΓ(ν2)(q2+ν)ν+2r2T((2r+νq2+ν)12λq|2r+ν),ΦSN(r,q)=2r+1Γ(ν+2r2)Γ(ν2)νrT2((2r+νν)12qe1|0,Σ,2r+ν),Φ(r,q)=Γ(ν+2r2)Γ(ν2)(2ν)rT((2r+νν)12q|2r+ν),φ(r,q)=Γ(ν+2r2)Γ(ν2)2π(ν2)ν2(q2+ν2)-ν+2r2,

with given in (2.2). The first two moments of Y = μ + σX are available through the MomTrunc R package (Galarza et al., 2020). So far, this is the unique method to compute the moments of the TST, among others truncated skewed (multivariate) distributions.

3. The skew-t censored linear regression model

Let us consider a linear regression model where the responses are observed with errors which are independent and identically distributed (iid) according to some ST distribution, as follows:

Yi=xiβ+σɛi,         ɛiiidST(0,1,λ,ν),   i=1,,n,

where Yi, i = 1, …, n are the responses, β = (β1, …, βp) is a vector of regression parameters and xi=(xi1,,xip) is a vector of covariates, such that xi j is the value of the jth explanatory variable for subject i. Under this setup, we have that YiindST(xiβ,σ2,λ,ν), i = 1, …, n. To facilitate the mathematical derivations, we consider the case where left-censored observations can occur, that is, the observations are of the form:

Yobsi={κi,if Yiκi,Yi,if Yi>κi,

i = 1, …, n, for some threshold point κi. The model defined in (3.1) and (3.2) is called the skew-t linear censored regression (ST-CR) model (Massuia et al., 2017; Mattos et al., 2018), for further details. Note that the right censored problem, as defined in the Introduction section, can be represented by a left censored problem by transforming the response Yobsi to −Yobsi.

3.1. Parameter estimation via an EM-type algorithm

In what follows, we use the traditional convention denoting a random variable by an upper case letter and its realization by the corresponding lower case. Let θ = (β, σ2, λ, ν) be the vector with all parameters of the ST-CR model. Supposing that are m censored values of the characteristic of interest, we can partition the observed sample yobs in two subsamples of m censored and nm uncensored values, such that yobs = {κ1, …, κm, ym+1, …, yn}. Then, the log-likelihood (log-like) function is given by

(θyobs)=log (i=1n[ΦST(κixiβ,σ2,λ,ν)]Ii[φST(yixiβ,σ2,λ,ν)]1-Ii)=i=1mlog [ΦST(κixiβ,σ2,λ,ν)]+i=m+1nlog [φST(yixiβ,σ2,λ,ν)],

where if yiκi and otherwise.

To estimate the parameters of the ST-CR model, an alternative is to maximize its log-likelihood function directly. However, this procedure can be quite cumbersome. Mattos et al. (2018) propose to compute the ML estimates by using SAEM algorithm. However, by its nature, MCEM is an expensive proposition, due to the combination of Monte Carlo simulation and other iterative procedures. Alternatively, in this work, we propose a simple EM-type algorithm (Dempster et al., 1977) to obtain the ML estimates. To implement the EM algorithm, we need a representation of the model in terms of missing data. In the case of censoring, we can consider the unobserved yi as a realization of the latent unobservable variable Yi~ST(xiβ,σ2,λ,ν), i = 1, …, m. Here, the key is to consider the augmented data {yobs, y1, …, ym, u1, …, un, t1, …, tn}, that is, we treat the problem as if yL = (y1, …, ym) were in fact observed. As a consequence, we can use the representation (2.4) to obtain the complete-data log-likelihood, given as

c(θyobs,yL,u,t)=C-n2log Γ+i=1nlog ui-12Γi=1nui(yi-xiβ-Δti)2+i=1nlog h(uiν),

where C is a constant that does not depend on the parameter of interest θ, u = (u1, …, un), t = (t1, …, tn) and h(· |ν) is the gamma density with scale and shape parameters equal to ν/2.

In what follows, the superscript (k) indicates the estimate of the related parameter at the stage k of the algorithm. In the E-step of the algorithm, we must obtain the so-called Q-function:

Q(θθ(k))=Eθ(k)[c(θYobs,YL,U,T)yobs],

where Eθ(k) means that the expectation is obtained using θ(k) instead of θ. Observe that the expression of the Q-function is completely determined by the knowledge of the expectations

rsi(θ(k))=Eθ(k)[UiTirYisyobsi],         r,s=0,1,2.

Thus, ignoring constants that do not depend on the parameter of interest, the Q-function can be written in a synthetic form as follows:

Q(θθ(k))=-n2log Γ-12Γi=1n[02i(θ(k))-201i(θ(k))xiβ+00i(θ(k))(xiβ)2+Δ220i(θ(k))-2Δ11i(θ(k))+2Δ10i(θ(k))xiβ]+i=1nEθ(k)[log h(Uiν)yobsi].

The M-step consists of maximization of Q(θ | θ(k)) with respect to θ. To do so, we resort to a faster extension of the original EM called the ECME algorithm (Liu and Rubin, 1994), replacing the M step with a sequence of conditional maximization (CM) steps and maximizing the actual log-likelihood function with respect to ν. Due to the CM step below, it is not necessary to compute the expectations Eθ(k) [log h(Ui|ν) | yobsi]. Thus, depending whether if the observation is censored or not and by using known properties of conditional expectation, the expectations involved in the Q-function will take specific, analytic and closed forms as follows:

  • Uncensored observation case. In this case, Yobsi=Yi~ST(xiβ,σ2,λ,ν) and from Proposition 1 (see also Basso et al., 2010), we have that

    rsi(θ(k))=yisEθ(k)[UiTiryi],         s,r=0,1,2,

    with

    Eθ(k)[Uiyi]=γ1(θ(k),yi),Eθ(k)[UiTiyi]=γ1(θ(k),yi)μTi(k)+MT(k)τ1(θ(k),yi)

    and

    Eθ(k)[UiTi2yi]=γ1(θ(k),yi)μTi2(k)+MT2(k)+μTi(k)MT(k)τ1(θ(k),yi),

    where MT2(k),μTi(k), γ1(θ(k), yi) and τ1(θ(k), yi) as defined in Proposition 1 with appropriate substitutions.

  • Censored observation case. Here, we have that Yobsi = κi if and only if Yiκi. Then,

    rsi(θ(k))=Eθ(k)[UiTirYisYiκi]=Eθ(k)[YisE [Ui[TirUi,Yi]Yi]Yiκi],         r,s,=0,1,2,

    where the conditional expectation in the second line of (3.4) is true because, if yi were available, then it would be a realization of a ST(xiβ,σ2,λ,ν) distribution and the expectation Eθ(k) [Ui|Yi, Yiκi] = Eθ(k) [Ui|Yi] in (3.3).

Finally, the expectation in (3.4) can be easily obtained by using the following Proposition:

Proposition 2

For a censored observation i = 1, 2, …, m, the conditional momentsrsi(θ(k)), r, s = 0, 1, 2, are given by

00i(θ(k))=Eθ(k)[UiYiκi]=ΦST(κixiβ(k),σ2*(k),λ(k),ν+2)ΦST(κixiβ(k),σ2(k),λ(k),ν),0ri(θ(k))=Eθ(k)[UiYirYiκi]=ΦST(κixiβ(k),σ2*(k),λ(k),ν+2)ΦST(κixiβ(k),σ2(k),λ(k),ν)Eθ(k)[Wir],10i(θ(k))=Eθ(k)[UiTiYiκi]=Δ(k)Δ2(k)+Γ(k)(01i(θ(k))-00i(θ(k))xiβ(k))+Γ(k)Δ2(k)+Γ(k)c(v)WΦ(θ(k)),20i(θ(k))=Eθ(k)[UiTi2Yiκi]=(Δ(k)Δ2(k)+Γ(k))2(02i(θ(k))-201i(θ(k))xiβ(k)+(xiβ)200i(θ(k)))   +Γ(k)Δ2(k)+Γ(k)(Δ(k)Δ2(k)+Γ(k))c(ν)WΦ(θ(k))(Eθ(k)(Si)-xiβ(k))+Γ(k)Δ2(k)+Γ(k),11i(θ(k))=Eθ(k)[UiTiYiYiκi]=(Δ(k)Δ2(k)+Γ(k))(02i(θ(k))-01i(θ(k))xiβ(k))+Γ(k)Δ2(k)+Γ(k)WΦ(θ(k))Eθ(k)(Si),

where

WΦ(θ(k))=ΦST(κixiβ(k),σ**2(k),0,ν+1)ΦST(κixiβ(k),σ2(k),λ(k),ν),Wi~TST-,κi(xiβ(k),σ2*(k),λ(k),ν+2),         Si~TST-,κi(xiβ(k),σ**2(k),0,ν+1),σ2*=νν+2σ2(k),         σ**2(k)=ν(ν+1)(1+λ2(k))σ2(k),         and         c(ν)=2Γ(ν+12)Γ(ν2)ν(1+λ2)π.
Proof

The proof follows from the conditional expectation property given in (3.4) along with the conditional expectations given in Proposition 1.

Thus, the EM algorithm for the ST-CR model, defined in (3.1) and (3.2), can be summarized in the following way:

  • E-step: Given θ = θ(k). For i = 1, …, n.

    • - If the observation i is uncensored then, for r, s = 0, 1, 2, compute ℰrsi(θ(k)) given in (3.3);

    • - If the observation i is censored then, for r, s = 0, 1, 2, compute ℰrsi(θ(k)) in (3.4) by using Proposition 2.

  • CM-step: Update θ(k) by maximizing Q(θ | θ(k)) over θ, which leads to the following expressions

    β(k+1)=(i=1n00i(θ(k))xixi)-1i=1nxi(01i(θ(k))-Δ10i(θ(k))),Δ(k+1)=i=1n(11i(θ(k))-10i(θ(k))xiβ(k+1))i=1n20i(θ(k)),Γ(k+1)=1ni=1n[02i(θ(k))-201i(θ(k))xiβ(k+1)+00i(θ(k))(xiβ(k+1))2+Δ2(k+1)20i(θ(k))-2Δ(k+1)11i(θ(k))+2Δ(k+1)10i(θ(k))xiβ(k+1)],

  • CML-step: Update ν(k) by maximizing the actual marginal log-likelihood function, obtaining

    ν(k+1)=argmaxν{i=1nlog[ΦST(κixiβ(k+1),σ2(k+1),λ(k+1),ν)]+i=m+1nlog φST(yixiβ(k+1),σ2(k+1),λ(k+1),ν)}.

This process is iterated until some distance involving two successive evaluations of the actual log-likelihood (θ|yobs), like |(θ(k+1)|yobs) − (θ(k)|yobs)| or |(θ(k+1)|yobs)/ℓ(θ(k)|yobs) − 1|, is small enough. The optimization step related to ν can be easily accomplished by using the optim routine in the R software after a double integration procedure. Note that σ2(k) and λ(k) can be obtained from Δ(k) and Γ(k), by considering

σ2(k)=Γ(k)+Δ2(k)         and         λ(k)=Δ(k)Γ(k).

Assuming complete data, i.e., ignoring censoring, we used ordinary least squares (OLS) estimators as initial estimates of β(0) and the moment estimator for σ2(0) by using the R function lm(). For λ(0), we considered twice the signal of the skewness coefficient of the residuals, and finally, ν(0) was fixed at 3.

3.2. Model selection

To select an appropriate model from the candidates, we adopted the Akaike information criterion (AIC) (Akaike, 1974), the Bayesian information criterion (BIC) (Schwarz, 1978), the consistent AIC (CAIC) (Bozdogan, 1987) and the Hannan-Quinn information criterion (HQIC) (Burnham and Anderson, 2002).

Like the more popular AIC and BIC, which are the two most widely used model selection indices based on penalized likelihood and applicable for both nested and non-nested models, the model selection criteria considered in this work have the form −2(θ̂) + ρcn, where θ̂ is the ML estimate obtained via the EM algorithm, (θ) = (θ|yobs) is the actual log-likelihood, ρ is the number of free parameters that have to be estimated in the model and the penalty term cn is a convenient sequence of positive numbers that depends on the sample size n. Specifically we have AIC = −2(θ̂) + 2ρ, BIC = −2(θ̂) + ρ log(n), HQIC = −2(θ̂) + 2ρ log(log(n)) and CAIC = −2(θ̂) + ρ(log(n) + 1). A lower AIC, BIC, HQIC or CAIC value indicates that a closer fit of the model to the data.

3.3. Approximate standard errors

The standard errors of the ML estimates can be approximated by the inverse of the observed information matrix. Unfortunately, in our case, there is no a closed-form available for this matrix. Thus, we consider the same strategy used by Garay et al. (2017a) to get approximate standard errors of the parameter estimates based on the empirical information matrix by assuming ν known. Let Yobs be the vector of observed data. Then, considering θ = (β, σ2, λ), Ycomp = (Yobs, YL, U, T) and relations described in the Equation (3.5), the empirical information matrix is defined as

Ie(θyobs)=i=1ns(yobsiθ)s(yobsiθ)-1nS(yobsθ)S(yobsθ),

where S(yobsθ)=i=1ns(yobsiθ). It can be noted from the result of Louis (1982) that the individual score can be determined as

s(yobsiθ)=Qi(θθ(k))θ,         i=1,,n.

Thus, substituting the ML estimates of θ in (3.6), the empirical information matrix Ie(θ|yobs) is reduced to

Ie(θ^Yobs)=i=1ns^isi,

where s^i=(s^βi,s^σi2,s^λi) is an individual score vector and

s^βi=1+λ^2σ^2(xi01i(θ^)-00i(θ^)xixiβ^-σ^λ^1+λ^2xi10i(θ^)),s^σi2=-12σ^2+1+λ^22σ^4(02i(θ^)-201i(θ^)xiβ^+00i(θ^)(xiβ^)2)-λ^1+λ^22σ^3(11i(θ^)-10i(θ^)xiβ^),s^λi=λ^1+λ^2-λ^σ^2(02i(θ^)-201i(θ^)xiβ^+00i(θ^)(xiβ^)2)+1+2λ^2σ1+λ^2(11i(θ^)-10i(θ^)xiβ^)-λ^20i(θ^),

for i = 1, …, n, where the conditional expectations rsi(ω(k))=Eθ(k)[UiTirYisYiκi], r, s = 0, 1, 2, can be easily obtained directly from the proposed EM algorithm.

4. Simulation studies

In this section, we present three simulations studies. In the first one, we study the performance of EM estimates under different censoring proportions. The second simulation study investigates whether the model comparison measures, AIC, BIC, CAIC, and HQIC determine the best-fitting model to the simulated data. The third study compares the performance of ML estimates obtained through EM and SAEM algorithms. For each scenario, 100 Monte Carlo samples were generated, and the data were artificially generated from the ST-CR model defined in (3.1) and (3.2), with xiT=(1,xi1,xi2), such that xi1 ~ U(1, 5) and xi2 ~ U(−1, 1), for i = 1, …, n.

4.1. Performance of EM estimates

This first simulation study is built to analyze the impact of the censoring level on the estimates of the ST-CR model. We consider three censoring levels, say, 5%, 10% and 20%. In a first part of this simulation study, each Monte Carlo sample is composed of a n = 1000 random draws from the ST-CR model, with the following parameters: β = (β1, β2, β3)T = (1, 2, 3)T, σ2 = 1, λ = −2 and ν = 4. From the generated data, in order to define the censored cases, we follow two steps: First, we sorted the observations in ascending order, and then the cases were defined as censored if the percentage of the observations below the fixed threshold was equal to the corresponding censoring level.

We computed the average values (MC Mean) and standard deviations (MC sd) across the estimates of the 100 Monte Carlo samples. Also, the average values of the approximate standard errors of the EM estimates were computed through the method described in Subsection 3.3. Table 1 shows the results for different censoring levels. We can see from this table that the EM-type algorithm recovers the original values of the parameters for all levels of censoring, closely. We also observe from this table that the estimates of the standard errors (IM SE) and MC standard deviations (MC sd) give relatively close results, showing that the proposed asymptotic approximation for the variance of the EM estimates is reliable.

The second part of this study is devoted to analyze the finite sample properties of the EM estimates under the same parameters setting. In this case, we fix the censoring level increasing the sample size, say, n = 200, 500, and 1000 for each Monte Carlo sample. Figure 1 shows boxplots of the EM estimates under the ST-CR model with censoring rate of 10%. We can see that the increased sample size corresponds to a decreasing bias and variability of the parameter estimates revealing that the ML estimates obtained via the proposed EM algorithm have consistent asymptotic properties. This tendency remains when we change the censoring rate.

4.2. Regression models comparison

In this simulation scheme, we compare the performance of some classic model comparison criteria to select the appropriate model among four different right-censored regression models, namely, the normal (N-CR), Student’s-t (T-CR), skew-normal (SN-CR) and skew-t (ST-CR) models. In order to enhance the reliability of the model selection, we used well-known criteria, AIC, BIC, CAIC, and HQIC, explained in Subsection 3.2 to select the proper model. Table 2 shows the arithmetic average of these comparison measures. Note that all the measures favor the ST-CR model, indicating that the ST-CR model is generally more robust to deviations from the model assumptions and fits better than other candidates when neither is the true generating model. Figure 2 represents the AIC, BIC, CAIC, and HQIC values for each sample and model with a left-censoring level of 10%. From this figure, we can see that in 100% of the cases all the model selection criteria select the ST-CR model, as expected.

4.3. Comparison between SAEM and EM algorithms

In this simulation study, we compare the ML estimates obtained via the proposed EM-algorithm and SAEM (Mattos et al., 2018). The parameter setting is the same as the previous simulation with censoring level 20% and the sample size for each Monte Carlo sample is n = 500. The initial values of the EM algorithm were obtained as discussed in Subsection 3.1 with a maximum number of iterations max.iter = 400. In addition, we chose the parameters m = 20 (Monte Carlo sample size), c = 0.3 (cut-off point) and S = 400 (maximum number of iterations) for the SAEM implementation. The results are given in Table 3, where we can see that the ML estimates of the parameters obtained through the EM and the SAEM algorithms are close, as expected. However, the SAEM algorithm requires, in average, 3 times longer that the EM algorithm to reach the convergence. Consequently, we can conclude that our proposal provides the same accuracy as other competitors but with less computational time for obtaining ML estimates.

5. Application

LNK has been identified to be an important predictor of learning to read, spelling abilities, phonological awareness and intelligence (see, for example, Foulin (2005), Ritchey and Speece (2006) and references there in). However, not only LNK is considered as a predictor of spelling achievements, but also the speed of children in letter naming. This is another letter-name related skill closely associated to reading achievement (Cronin and Carver, 1998). A frequently used procedure considered by school teachers to measure LNK is based on LNF. In this case, teachers administer timed 1-minute fluency assessments to children, and then compare the results with established norms in order to determine how the students are performing in this task and if they are at risk for future academic problems. Observe that LNF is a continuous variable related to the average of letters read correctly in an interval of time and not a discrete variable.

In general, LNF can be considered a measure of risk (Marston and Magnusson, 1988) because it is highly predictive of later reading success. It is also included as an indicator for students with lower scores, who may require additional instructional support on their Early Literacy Skills (ELS). Additionally, LNF has been recognized as the best predictor of future reading and spelling abilities in children, and analyzing it, a benchmark can be obtained to determine the minimum level of satisfactory progress of spelling achievements by a student. However, LNF presents some challenges. Due to the time limit, some students may not complete the assessments while others will finish them before the set time. This situation generates that, if the teachers are interested in the average of the letters/words/sentences/paragraphs correctly read, the time limit restriction has to be taken into account. In other words, the students who finish the task in less than 1-minute could read more letters/words/sentences/paragraphs, and the reported mean corresponding to the correctly read objects could not be the real one. The situation described above corresponds to the typical case of right censored observations.

In this paper, we analyze LNF data set from the Early Grade Reading Assessment (EGRA), which is part of (RTI-FDA, 2008) instrument. Here, the response variable LNF is defined as the number of correctly letters read by the students in one minute and it presents right censored observations. Consequently, if we are interested in the mean of the LNF response for one group, this quantity could be underestimated due to the presence of censored observations. For that reason, a censored regression model able to take into account observation lying below or above a threshold could be more appropriated for estimating the true mean of the LNF response for different groups of interest.

Table 4 shows a summary of the response variable in the presence (Censored) and absence (Uncensored) of censoring. Note that 6.26% of the observations are censored, and the mean and the standard deviation of them are higher in comparison with uncensored observations. Therefore, the mean of 31.3 for the LNF response (letters correctly read per minute) showed in Table 4 seems to be underestimated. Further, we can observe some degree of right skewness and kurtosis on the response variable reveling a departure from the normal distribution and then other distribution must be considered for the data.

Additionally, three socio-demographic covariates for the students are considered in the LNF data, namely, the Zone where the respondent lives (0 = Urban, 1 = Rural), Grade (0 = 2nd grade, 1 = 3rd grade) and Gender (0 = Male, 1 = Female). Note that the proportions for rural zone, 3rd grade and female on the censored and uncensored groups are similar. Since the explanatory variables are all dummy variables, we expect to understand the effects of these covariates on the LNF response.

To investigate the behavior of the LNF in the Peruvian sample, and taking into account the censoring effect, we fit four different right-censored regression models, namely, the normal (N-CR), Student’s-t (T-CR), skew-normal (SN-CR) and skew-t (ST-CR) models. Particularly, the proposed censored model is given by:

Yi=β0+β1Zonei+β2Gradei+β3Genderi+σɛi,         i=1,,n,

where the error term εi is independent for all i = 1, …, n and follows some of the distributions previously mentioned with location and scale parameters equal to 0 and 1, respectively.

Table 5 compares the fit of the four proposed models using the model selection criteria discussed in Subsection 3.2. Considering these results, we observe that the ST-CR model outperforms all its competitors (N-CR, SN-CR and T-CR). This conclusion is also corroborated by Figure 3, where we present the plot of the Martingale-type residuals including a simulated envelope (Mattos et al., 2018) for the T-CR, SN-CR and ST-CR models (See also Figure 1). It can be seen that the skew-t distribution accommodates the observations in a better way than its competitors.

Table 6 reports the ML estimates for the parameters of the best two models according to the previous analyses, i.e., the SN-CR and ST-CR models, along with their corresponding standard errors calculated via the empirical information matrix (Subsection 3.3).

Using a tolerance of ε = 10−5 as stopping criterion, the algorithm attained convergence in 32 iterations and 2.58 seconds for the SN-CR model, and 27.61 seconds and 25 iterations for the ST-CR model, this on an Intel Core i7-6700, CPU @ 3.40GHz computer with 8 GB of RAM. The monotone convergence of the parameter estimates via the proposed EM algorithm is illustrated in Figure 4, where we can see that the log-likelihood increases in successive iterates of the EM algorithm.

Note from Table 6 that the covariate Gender can be considered non significant. The covariate Zone has a negative effect in favor of rural schools and covariate Grade has a positive effect in the students of the 3rd grade. As a conclusion, it can be stated that students from urban schools and the 3rd grade read correctly more letters than students from rural schools and the second grade.

By considering these results, we fitted an additional ST-CR model without the covariate Gender. The resulting fitted model is the following:

Y^i~ST(μ^=14.831-4.714Zonei+6.192Gradei,σ^2=377.164,λ^=2.709,v^=5.182),

i = 1, …, 511, where the model comparison criteria are: Log-Likelihood = −1995.335, AIC = 4002.671, BIC = 4028.089, CAIC = 4034.089 and HQ = 4012.635.

It is worth mentioning that, under this model, the mean of the number of correct letters read by the Peruvian students in the sample, under the censored model, is 32.29, while when the censoring mechanism is omitted, the mean is 31.2 That is, if censored is not taken into consideration for modeling the LNF, the mean of the number of correct letters can be sub estimated. Since this kind of test is used frequently, we recommend to incorporate the censoring pattern in order to estimate the statistics of Fluency conveniently. Additionally, the summary of the estimated response variable is Min=1 and Max = 122.81 where quartile 1 is 21, median is 29 and quartile 3 is 42. Values which can be used to propose different criteria to classify the LNF.

6. Conclusions

In this paper, a novel exact EM-type algorithm for skew-t censored linear regression model is developed. In contrast with previous developments (MCEM and MCMC algorithms), the proposed EM-type algorithm uses analytical expressions at the E-step, that rely on formulas of the mean and variance of a truncated skew-t distribution. These formulas have been developed by Lachos et al. (2020) and are available in the R package MomTrunc (Galarza et al., 2018). As an added benefit of the proposal, the EM likelihood sequence is monotonic and the difficulties in assessing convergence, which face Monte Carlo algorithms, are avoided. Furthermore, simulations studies and the analysis of the LNF dataset provides strong evidence about the implementation of the EM-type algorithm for fitting the ST-CR model. The method proposed in this paper is implemented in R, and the code is available for download from GitHub repository (https://github.com/hlachos/skewt-censored).

Finally, some extensions of the current work includes the multivariate ST-CR model and finite mixture of censored skew-t models (Azzalini and Capitanio, 1999; Lachos et al., 2017). An in-depth investigation of such extensions is beyond the scope of the present paper, but certainly an interesting topic for future research.

Figures
Fig. 1. Boxplot of the estimates of β, β, β, σ, λ and ν (line indicates the true value of the parameter) for the ST-CR model. Legend in (a): the first three boxplots correspond to the left-censoring level of c = 5% with n = 200, 500, 1,000, respectively. The boxplots for c = 10%, 20% are defined as similar way.
Fig. 2. AIC, BIC, CAIC and HQIC values for 100 samples and left-censoring level of 10%. Black curve: N-CR, red curve: T-CR, green curve: SN-CR and purple curve: ST-CR.
Fig. 3. LNF data: Martingale-type residuals under the (a) N-CR, (b) T-CR, (c) SN-CR and (d) ST-CR models. The envelope is obtained through simulation.
Fig. 4. LNF data: Plot of the log-likelihood vs. Iterations of the EM-type algorithm under the (a) SN-CR and (b) ST-CR models.
TABLES

Table 1

Results based on 100 simulated ST samples with n = 1000

Cens

5%10%20%
β0 (1)1.031.031.03
IM SE0.080.080.09
MC sd0.090.090.09

β1 (2)2.002.002.00
IM SE0.020.020.02
MC sd0.020.020.03

β2 (3)3.003.002.99
IM SE0.040.040.05
MC sd0.170.170.15

σ2(1)1.091.091.09
IM SE0.100.100.11
MC sd0.040.040.04

λ (−2)−2.21−2.21−2.18
IM SE0.280.280.29
MC sd0.370.360.31

ν (4)4.274.334.37
IM SE--
MC sd0.760.770.79

The reported values are the MC means, and the MC sd are the standard deviations from fitting the ST-CR with different settings of censoring proportions. IM SE is the average value of the approximate standard error obtained through the information-based method. Cens indicates the censoring rate.


Table 2

Model selection criteria for comparing N-CR, SN-CR, T-CR and ST-CR models under three different censoring proportions

CensoringCriteriaN-CRSN-CRT-CRST-CR
5%log-like−1400.080−1320.192−1320.621−1274.855
AIC2808.1602652.3852651.2422561.711
BIC2827.7912681.8312675.7812591.157
CAIC2831.7912687.8312680.7812597.157
HQIC2815.6212663.5772660.5682572.903

10%log-like−1327.306−1261.528−1257.535−1217.715
AIC2662.6122535.0562525.0702447.429
BIC2682.2432564.5022549.6092476.876
CAIC2686.2432570.5022554.6092482.876
HQIC2670.0732546.2472534.3962458.621

20%log-like−1188.030−1139.859−1132.080−1099.499
AIC2384.0602291.7192274.1612210.997
BIC2403.6912321.1652298.6992240.444
CAIC2407.6912327.1652303.6992246.444
HQIC2391.5212302.9112283.4872222.189

Table 3

MC means and MC standard deviations (in parentheses) of the ML estimates obtained through the EM and the SAEM algorithms

Algorithmβ1β2σ2λνlog-likeTime (min)
SAEM1.9866 (0.0364)2.9997 (0.0672)1.1900 (0.1727)−2.2938 (0.4411)5.7270 (1.6433)−541.8204 (18.4327)1.4450 (0.7873)
EM1.9880 (0.0367)3.0022 (0.0673)1.2102 (0.1859)−2.3056 (0.4230)6.1596 (1.7823)−542.0523 (18.6224)0.5565 (0.5450)

Table 4

LNF data. Summary statistics of LNF response and proportions for explanatory variables by uncensored and censored groups

VariableStatisticsUncensoredCensoredTotal
LNFn47932511
mean30.67439.84431.249
median283129
sd16.22820.71816.669
min1281
max989999
skewness1.2652.1241.412
kurtosis2.5673.0753.131

Zonerural0.3130.3120.313

Grade3rd0.2900.2810.290

Genderfemale0.5010.5000.501

Table 5

LNF data: Model selection criteria

Model# parameterslog-likeAICBICCAICHQIC
N-CR5−2058.9624127.9234149.1054154.1054136.227
SN-CR6−2007.7384027.4754052.8944058.8944037.44
T-CR6−2023.1044058.2084083.6264089.6264068.173
ST-CR7−1995.3364004.6724034.3274041.3274016.298

Table 6

LNF data. ML estimates under SN-CR and ST-CR models. SE is the corresponding standard error

ParameterSN-CRST-CR


EstimateSEEstimateSE
Intercept β012.7061.32314.8131.349
Zone β1−4.6751.510−4.7131.291
Grade β25.9721.3236.1921.253
Gender β3−0.1591.2060.0321.137
σ2659.59644.616377.29544.522
λ3.8000.5992.7100.509
ν--5.184-

References
  1. Akaike H (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723.
    CrossRef
  2. Arellano-Valle RB, Castro LM, González-Farías G, and Muñoz-Gajardo KA (2012). Student-t censored regression model: properties and inference. Statistical Methods & Applications, 21, 453-473.
    CrossRef
  3. Azzalini A (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12, 171-178.
  4. Azzalini A and Capitanio A (1999). Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B, 61, 579-602.
    CrossRef
  5. Azzalini A and Capitanio A (2003). Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 367-389.
    CrossRef
  6. Azzalini A and Dalla Valle A (1996). The multivariate skew-normal distribution. Biometrika, 83, 715-726.
    CrossRef
  7. Basso RM, Lachos VH, Cabral CR, and Ghosh P (2010). Robust mixture modeling based on scale mixtures of skew-normal distributions. Computational Statistics & Data Analysis, 54, 2926-2941.
    CrossRef
  8. Bozdogan H (1987). Model selection and Akaikeś Information Criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345-370.
    CrossRef
  9. Burnham KP and Anderson DR (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed), Springer-Verlag.
  10. Cronin V and Carver P (1998). Phonological sensitivity, rapid naming and beginning reading. Applied Psycholinguistics, 19, 447-461.
    CrossRef
  11. Dempster A, Laird N, and Rubin D (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38.
  12. Foulin JN (2005). Why is letter-name knowledge such a good predictor of learning to read?. Reading and Writting, 38, 129-155.
    CrossRef
  13. Galarza CM, Kan R, and Lachos VH (2020) MomTrunc: Moments of Folded and Doubly Truncated Multivariate Distributions, R package version 5.69 . http://cran.r-project.org/package=MomTrunc
  14. Garay AM, Lachos VH, Bolfarine H, and Cabral CRB (2017a). Linear censored regression models with scale mixtures of normal distributions. Statistical Papers, 58, 247-278.
    CrossRef
  15. Garay AW, Massuia MB, and Lachos VH (2017b) BayesCR: Bayesian Analysis of Censored Regression Models Under Scale Mixture of Skew Normal Distributions. R package version 2.1 . http://cran.r-project.org/package=BayesCR
  16. Lachos VH, Garay A, and Cabral CR (2020). Moments of truncated skew-normal/independent distributions. Brazilian Journal of Probability and Statistics, 34, 478-494.
    CrossRef
  17. Lachos VH, Moreno EJL, Chen K, and Cabral CRB (2017). Finite mixture modeling of censored data using the multivariate Student-t distribution. Journal of Multivariate Analysis, 159, 151-167.
    CrossRef
  18. Liu C and Rubin DB (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633-648.
    CrossRef
  19. Louis TA (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44, 226-233.
  20. Marston D and Magnusson D (1988), Graden J, Zins J, and Curtis M (Eds). Alternative Educational Delivery Systems: Enhancing Instructional Options for All Students, (pp. 137-172), Washington, DC, Publisher = National Association of School Psychology.
  21. Massuia MB, Cabral CRB, Matos LA, and Lachos VH (2015). Influence diagnostics for Student-t censored linear regression models. Statistics, 49, 1074-1094.
    CrossRef
  22. Massuia MB, Garay AM, Lachos VH, and Cabral CRB (2017). Bayesian analysis of censored linear regression models with scale mixtures of skew-normal distributions. Statistics and its Interface, 10, 425-439.
    CrossRef
  23. Mattos TdB, Garay AM, and Lachos VH (2018). Likelihood-based inference for censored linear regression models with scale mixtures of skew-normal distributions. Journal of Applied Statistics, 45, 2039-2066.
    CrossRef
  24. Ritchey K and Speece D (2006). From letter names to word reading: The nascent role of sublexical fluency. Contemporary Educational Psychology, 31, 301-327.
    CrossRef
  25. RTI-FDA (2008). Snapshot of School Management Effectiveness: Peru Pilot Study. Technical report, USAID.
  26. Schwarz G (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.
    CrossRef