Panel data sets have been developed in various areas, and many recent studies have analyzed panel, or longitudinal data sets. Maximum likelihood (ML) may be the most common statistical method for analyzing panel data models; however, the inference based on the ML estimate will have an inflated Type I error because the ML method tends to give a downwardly biased estimate of variance components when the sample size is small. The under estimation could be severe when data is incomplete. This paper proposes the restricted maximum likelihood (REML) method for a random effects panel data model with a censored dependent variable. Note that the likelihood function of the model is complex in that it includes a multidimensional integral. Many authors proposed to use integral approximation methods for the computation of likelihood function; however, it is well known that integral approximation methods are inadequate for high dimensional integrals in practice. This paper introduces to use the moments of truncated multivariate normal random vector for the calculation of multidimensional integral. In addition, a proper asymptotic standard error of REML estimate is given.
The panel regression model with individual specific effects has the following specification:
where
The standard assumption of the error term, viz.
A generalization of the standard Tobit model is that the dependent variable can be censored in either left, right or both directions of a Type I Tobit model, where the observed values of the dependent variable are defined as
where ℓ and
Such a censored dependent variable commonly occurs in survival analysis, biomedical and epidemiological studies. The usual estimation method fails to provide consistent estimates for a conventional regression model. This leads to discussing the estimation methods in the censored regression model (Tobin, 1958; Maddala, 1983; and Amemiya, 1984). Currently, the dominating method may be the maximum likelihood (ML), which is implemented in most statistical packages dealing with a censored regression model. For example, R packages such as
The ML estimator in nonlinear panel data model with fixed effects is widely understood to be biased and inconsistent when the length of panel
Note that REML estimates variance components on the basis of residuals resulting after eliminating the fixed effects contained in a model. This makes REML divide the mean squared deviation by degrees of freedom instead of by sample size, which can remedy the downward bias of ML. It also has a Bayesian justification. Today, REML is widely used for the estimation of variance components in various mixed effects models with complete data, but surprisingly it is not well established for limited dependent variable models such as the binary or the censored dependent variable model. Even the definition of REML procedure is unclear. For example, Lee and Nelder (2001) regarded the REML as an adjusted profile likelihood method, but Drum and McCullagh (1993) considered it as an unbiased estimation equation method. See, Noh and Lee (2007) for further details.
The difficulty of an ML based approach for the limited dependent variable model lies mainly in computational problem rather than theoretical aspect. For instance, Hughes (1999) provided ML and REML estimates in a general mixed-effects linear model with censored data using a Monte Carlo EM algorithm and claimed that the approach can be used with an arbitrarily complex design matrix; however, such a EM based method lacks the capability of providing standard errors of variance components. He gave asymptotic standard errors for only fixed effects relying on the maximum likelihood theory. Since the asymptotic standard error does not take account of the estimation of variance components, the approximation may not be theoretically appealing. Like Hughes (1999), most works dealing with REML focus on the parameter estimation itself and did not mention the standard error of REML estimates. It is believed that the calculation of the asymptotic standard error of REML estimate is another challenging work.
The problem lies in the likelihood based methods with a limited dependent variable is the computation likelihood function. It is difficult to maximize the function directly because the likelihood function contains multidimensional integrals. Many authors proposed to use an integral approximation method for the computation of a likelihood function, and then maximize the approximated likelihood function. Various integral approximation methods such as Newton and Gauss-Hermite quadratures, Monte Carlo integration and Markov Chain Monte Carlo have been employed for this purpose. The approximation method enables the likelihood based estimation in the limited dependent variable model; however, it is well known that such integral approximation methods are inadequate for high dimensional integrals. It may be difficult to get a good approximation when the number of censored observation is large under the censored random effects panel model. Indeed, it is observed that statistical packages using different methods give quite different results. See Zhang
The main object of this paper is to present a REML procedure for a censored dependent variable model. Many authors have considered REML in binary dependent variable models, but we can only find limited literature on REML estimation with censored data. The censored dependent variable model resembles the normal-probit model for binary data; however, the methodology used in the normal-probit model cannot be directly applicable to the model considered here in that we do not use an approximation method. It is also demonstrated through a simulation study that when the sample size is small, REML is a proper method in the sense that inferences based on it have the Type I error rate close to a nominal level.
Let
where
Note that when data is complete, i.e., no censored observations, the REML equations for variance components are shown to be
where
where tr(
In general, the REML estimation includes no procedure to estimate fixed effects. The fixed-effects are estimated with the estimated random components in a complete data case; however, the conditional expectations are determined by the variance components as well as by the slope parameter
To solve the equations, it needs to compute the conditional expectations related to the moments of a multivariate truncated normal random vector. Thus, the calculation of the moments is essential and is the main problem of ML based approaches. Many researchers employed the Gibbs sampler or the Gauss-Hermit numerical integration method to approximate the moments and then used the EM or variations of EM algorithms. However, such numerical methods are generally not recommended for high dimensional integrals. Since, the dimensionality is increasing with the number of censored observations, it could not give proper approximation when the number of censored observations is large. An EM based method also lacks the capability of providing the standard error of REML estimate. To compute the standard error, it requires to compute up to the 4
There is a long history of the moment calculation for a multivariate truncated normal random vector. Using the moment generating function or recurrence relationships, many moment calculation methods have been proposed under various conditions, see Arismendi (2013). Among them, Kan and Robotti (2017) gave a method meet our demand. In what follows, we assume safely that necessary moments of truncated variables could be obtainable. It would be worth mentioning that their algorithm requires to compute 5
To compute the conditional expectations in (
Suppose and
Note that each element of
The conditional expectation of a quadratic form of
Here, a symmetric matrix
The Newton-Raphson method is applicable to solve estimation equations that require the derivative of the equations to form a Jacobian matrix. One may consider a numerical derivative method, which may reduce some computational burden, and hence may speed up getting the Jacobian matrix. However, we found that a numerical method is quite unstable. Besides, the Jacobian can be obtained analytically.
Using a well-known ML theory, the partial derivatives of
and
where
The conditional expectation can be written as
and (
Likewise,
Theorem 1 is quite standard and is useful not only for REML but also for ML estimation. Perhaps it is well known, but we could not find a statement of the second order derivatives when data is incomplete. We include the proof for completeness. The second order derivatives gives an advantage of our definition of REML as the solution of (
Once we note
Let S(
where
Since, the
and for each
where
and the last term of above equation is equal to
the conditional covariance can be calculated if
for
for
Thus, the nonlinear simultaneous equations, S(
are computable with conditional moments
The REML estimate of
with an arbitrary initial value
When data is incomplete, it seems that the standard error of REML is more complex than ML, because Louis (1982) provided the incomplete data Hessian in terms of the conditional expectations of the complete data Hessian, which makes it possible to work with incomplete data. Theoretically there is no difficulty in ML estimation since the Hessian is directly related to the variance of the ML estimate. Many alternatives to Louis’s method have also been proposed (Meilijson, 1989; Meng and Rubin, 1991; Duan and Fulop, 2011). The problem exists in the REML estimation is that, unlike the ML case, the Hessian does not give a variance. However, it is hard to find literature mentioning the standard error despite most researchers mentioning REML estimation with incomplete data. Regardless, a slight variant of the maximum likelihood theory can give an asymptotic variance of REML estimate.
The Taylor series expansion gives
In the ML estimation, the Hessian is equal to the Jacobian, and the variance of score is obtained by the expected value of minus the Hessian,
To use Theorem 2, Var(S(
but,
As stated before, the variance given in (
Since,
is applicable. Finally, the covariance of
The expected values in (
In this section, we demonstrate two examples to reflect some of main features of REML and ML estimates. In addition, we wish to talk about some computational issues.
The first data “EmpUK” presented by Arellano and Bond (1991) is an unbalanced panel of 140 observations from 1976 to 1984. It consists of 1031 observations on 7 variables. The dependent variable “emp” has a heavily right-skewed distribution, and some observations are quite large compared with most observations. The values of “emp” larger than 30 are treated the same in this example. That is, we assume that the dependent variable is right-censored at 30, then 57 out of 1031 are treated as censored observations. Because of the assumption, the estimation itself may not be meaningful, but it is believed that the data is suitable to show a large sample property of REML and ML estimates.
Table 1 shows the estimates of REML and ML. All the computations are done under R (R Core Team, 2017) with RcppArmadillo (Eddelbuettel and Sanderson, 2014) and Rcpp (Eddelbuettel
Both
We did not examine
We borrow an artificial panel data, which include 60 observations of 15 panels, shown in Henningsen (2017) for the second example. The dependent variable
It can be observed that the estimates are nearly the same for all slope parameters. For instance, relative distances between REML and ML estimates
REML was distinguished from ML in the estimation of variance components and standard errors. When a sample size is small, an inference based on ML estimates may not be appropriate because of the inflated Type I error due to the underestimated standard error. It seems that REML is more adequate for the inference of a censored regression model since REML reports a larger standard error than ML. To see this, we have performed a simulation under the study design used in Example 2. That is, we have generated a dataset according to
where
The data generation process has been repeated 1,000 times. Some estimates of variance components may become negative. In particular, the ML method often estimates
Note that the coverage probabilities of ML are smaller than the nominal level for all sample sizes. In particular, when
However, the ML tends to have smaller empirical mean squared error. Thus, ML is theoretically appearing when we wish to estimate the parameter itself; however, the inference, such as the interval estimation or the hypothesis testing on the slope parameter, based on ML may not be adequate when the sample size is small. Perhaps, this is well known in the general linear model.
We have considered a methodology under a censored random effect panel regression model, but the methodology can be applicable to a general mixed-effects linear model with limited dependent variables.
Estimates of EmpUK data
REML | ML | |||||
---|---|---|---|---|---|---|
Estimate | Std. err | Estimate | Std. err | |||
(Intercept) | 2.3431 | 0.7921 | 2.958 | 2.3423 | 0.7901 | 2.965 |
wage | −0.0814 | 0.0164 | −4.970 | −0.0814 | 0.0164 | −4.977 |
capital | 0.1245 | 0.0424 | 2.939 | 0.1248 | 0.0422 | 2.956 |
output | 0.0424 | 0.0039 | 10.971 | 0.0424 | 0.0039 | 10.986 |
35.1454 | 4.5396 | 34.8675 | 4.4131 | |||
1.1414 | 0.0562 | 1.1382 | 0.0558 |
REML = restricted maximum likelihood; ML = maximum likelihood.
Estimates of an artificial panel data
REML | ML | |||||
---|---|---|---|---|---|---|
Estimate | Std. err | Estimate | Std. err | |||
(Intercept) | −0.3921 | 0.4782 | −0.820 | −0.3655 | 0.4612 | −0.792 |
x1 | 1.7020 | 0.2186 | 7.787 | 1.6838 | 0.2124 | 7.927 |
x2 | 2.2875 | 0.6919 | 3.306 | 2.2636 | 0.6739 | 3.359 |
0.9005 | 0.5109 | 0.7961 | 0.4474 | |||
1.0175 | 0.2720 | 0.9734 | 0.2534 |
REML = restricted maximum likelihood; ML = maximum likelihood.
Summary of simulation
REML | ML | ||||||||
---|---|---|---|---|---|---|---|---|---|
Aver. Est | Aver. std | MSE | Coverage | Aver. Est | Aver. std | MSE | Coverage | ||
10 | −1.0630 | 0.5933 | 0.3953 | 0.937 | −1.0156 | 0.5604 | 0.3748 | 0.922 | |
2.0481 | 0.2918 | 0.0930 | 0.941 | 2.0225 | 0.2785 | 0.0891 | 0.936 | ||
3.0525 | 0.7827 | 0.6436 | 0.937 | 3.0156 | 0.6763 | 0.6852 | 0.925 | ||
1.0732 | 0.7353 | 0.5264 | 0.9133 | 0.6058 | 0.4005 | ||||
1.0027 | 0.3657 | 0.1363 | 0.9258 | 0.3218 | 0.1194 | ||||
15 | −1.0579 | 0.5354 | 0.3125 | 0.944 | −1.0195 | 0.5131 | 0.3077 | 0.924 | |
2.0184 | 0.2216 | 0.0758 | 0.949 | 2.0122 | 0.2550 | 0.0538 | 0.940 | ||
3.0500 | 0.7061 | 0.5191 | 0.940 | 3.0095 | 0.7608 | 0.5048 | 0.930 | ||
1.0483 | 0.5884 | 0.3474 | 0.9506 | 0.5532 | 0.3316 | ||||
0.9901 | 0.2982 | 0.0947 | 0.9408 | 0.2919 | 0.1008 |