TEXT SIZE

CrossRef (0)
Comparison of missing data methods in clustered survival data using Bayesian adaptive B-Spline estimation

Hanna Yooa, and Jae Won Lee1,b

aDepartment of Computer Software, Busan University of Foreign Studies, Korea, bDepartment of Statistics, Korea University, Korea
Correspondence to: Department of Statistics, Korea University, 145 Anam-Ro, SeongBuk- Gu, Seoul 02841, Korea. E-mail: jael@korea.ac.kr
Received September 13, 2017; Revised December 21, 2017; Accepted December 22, 2017.
Abstract

In many epidemiological studies, missing values in the outcome arise due to censoring. Such censoring is what makes survival analysis special and differentiated from other analytical methods. There are many methods that deal with censored data in survival analysis. However, few studies have dealt with missing covariates in survival data. Furthermore, studies dealing with missing covariates are rare when data are clustered. In this paper, we conducted a simulation study to compare results of several missing data methods when data had clustered multi-structured type with missing covariates. In this study, we modeled unknown baseline hazard and frailty with Bayesian B-Spline to obtain more smooth and accurate estimates. We also used prior information to achieve more accurate results. We assumed the missing mechanism as MAR. We compared the performance of five different missing data techniques and compared these results through simulation studies. We also presented results from a Multi-Center study of Korean IBD patients with Crohn’s disease (Lee et al., Journal of the Korean Society of Coloproctology, 28, 188–194, 2012).

Keywords : Bayesian adaptive B-spline, clustered data, MICE, missing covariates, multiple imputation, single imputation
1. Introduction

In epidemiological studies, missing covariate is a common problem in survival data and is often encountered in many statistical applications. The easiest way to deal with missing covariates is to discard missing observations and use only completely observed subjects. However, using only observed subjects may lead to biased estimates of parameters. As the proportion of missing data increases, there will be a substantial loss of efficiency. Rather than discarding missing data, statistical approaches can be applied under the presence of missing covariates. Zhou and Pepe (1995) have proposed an estimated partial likelihood method to estimate relative risk parameters. Lipsitz and Ibrahim (1996) have extended the EM by the method of weights to survival outcomes whose distributions may not fall in the class of generalized linear models. Chen and Little (1999) have used a nonparametric maximum likelihood to estimate regression parameters in a proportional hazards regression model with missing covariates. All these methods mentioned above can be used with the Cox model for univariate survival data with missing covariates. Imputing the missing data can be performed instead of using a likelihood based method to handle missing covariates. Likelihood based methods can be sophisticated. They generally require problem-specific programs. However, imputing the missing data enables the use of whole data and analysis can be performed easily using existing methods. Marshall et al. (2010) have compared several imputing methods (complete case (CC) analysis, single imputation (SI), and five multiple imputation methods) for handling missing covariates in univariate survival data through simulations. With or without imputation, most papers are based on the frequentist maximum likelihood method. Bayesian approaches have also merged and recently applied to survival data more frequently compared to the frequentist maximum likelihood method. Sharef et al. (2010) proposed a Bayesian adaptive B-spline estimation when clustered survival data are complete without missing covariates. They modeled the unknown baseline hazard and density of the random effects using a penalized mixture of B-splines. Inclusion of prior information of baseline hazard function and frailty density made the model more flexible. Using the spline component also improved the performance when the hazard or the frailty density was distinctly non-smooth in nature Sharef et al. (2010).

In this paper, we extended the Bayesian adaptive B-spline estimation method to clustered survival data with missing covariates. This study compares built-in imputation methods with CC analysis in the presence of missing covariates under clustered survival data using a Bayesian approach. We used the Bayesian adaptive B-spline estimation method proposed by Sharef et al. (2010) after imputing the missing covariates. We conducted a simulation study and compared five different missing data techniques: 1) complete case (CC) analysis, 2) single imputation (SI) using predictive mean matching (PMM), 3) MI-MICE, 4) MI-MICE-PMM, and 5) MI-AregImpute (the last three techniques are multiple imputation techniques using multivariate imputation by chained equations (MICE) available within R statistical software). The remainder of the paper is organized as follows. In Section 2, we briefly described each of the five imputation methods. We introduced how we extended the Bayesian adaptive B-spline estimation method when the data had missing covariates in Section 3. Results of the simulation study are summarized in Section 4. We then showed results applying to Crohn’s disease data in Section 5. We then concluded the study with a brief discussion in Section 6.

2. Missing data methods

Missing covariates are imputed through four different imputation methods with CC analysis. These imputation methods included one SI and three multiple imputation methods. We will briefly describe each method below.

### 2.1. Complete case analysis

The simplest way to handle missing values in the data is to use only observed values (that is, only CC is used). The CC estimator is unbiased when the missing mechanism is missing complete at random (MCAR). However, if there are many variables with missing values, then a large proportion of observations may be dropped. This can cause bias in missing at random (MAR) and lose efficiency in MCAR.

### 2.2. Single imputation using predictive mean matching

Imputing a missing value with a predicted value is called predictive mean matching. It is used to impute a value randomly from a set of observed values whose predicted values are closest to predicted values from a specified regression model (Heitjan and Little 1991; Schenker and Taylor 1996).

Let Y = (Y1, . . . , Yn) be a n × p matrix composed of n subjects with p variables. Let Yi = (Yi1, . . . , Yip) be one of incomplete variables and denote $Yobs=(Y1obs,…,Ypobs)$ and $Ymis=(Y1mis,…,Ypmis)$ as observed data and missing data in Y, respectively. For each incomplete subject, the conditional expected value μ̂ = E(Ymis|Yobs) is estimated and the missing value is imputed through the nearest neighborhood subject. When a continuous random variable is imputed, this method is straightforward. However, it might be more complicated for a categorical variable.

### 2.3. Multiple imputation

Multiple imputation is a three-step approach in estimating incomplete data regression models (Rubin, 1987). The notation of Y is the same as described above. Q is denoted as the quantity of interest. The first step is to create plausible values for missing observations from a specifically modeled distribution that can reflect uncertainty about the nonresponse model. Usually, five to ten imputed datasets are created. The imputed dataset is denoted as Y(1), . . . , Y(m), where m is the number of imputations. The next step is to analyze the imputed data using typical methods we would have used for complete data. Denote m estimates as (1), . . . , (m) for each imputed dataset. The last step is to combine results for each imputed dataset and obtain pooled estimate $Q¯=(1/m)∑i=1mQ^(i)$. Compared with SI, multiple imputation methods can minimize standard error and increase efficiency of estimates. However, they are more difficult to perform.

The MI-MICE is an imputation method using multivariate imputation by chained equations. MICE is a software for imputing incomplete multivariate data by fully conditional specification. It appeared in the R package library in 2001. Fully conditional method specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each for incomplete variable (van Buuren, 2007). Under the assumption that a multivariate distribution exists from conditional distributions, MICE constructs a Gibbs sampler from specified conditionals. Let θ = (θ1, . . . , θp) denote the vector of p unknown parameters of the multivariate distribution of Y. That is, starting from observed marginal distributions, the ith iteration of chained equations is a Gibbs sampler that successively draws.

$θ1*(t) ~ P (θ1∣Y1obs,Y2(t-1),…,Yp(t-1)), ⋮ θp*(t) ~ P (θp∣Ypobs,Y1(t-1),…,Yp-1(t-1)), Y1*(t) ~ P (Y1∣Y1obs,Y2(t-1),…,Yp(t-1),θ1*(t)), ⋮ Yp*(t) ~ P (Yp∣Ypobs,Y1(t-1),…,Yp-1(t-1),θp*(t)),$

where $Yi(t)=(Yiobs,Yi(t))$ is the ith imputed variable at iteration t. This method has different names, including regression switching (van Buuren et al., 1999), variable-by-variable imputation (Brand, 1999), and chained equations. In this paper, we used imputation methods built in the MICE software in the R package library.

MI-MICE-PMM is a method that imputes missing values under multiple imputation with regression switching (MICE) using predictive mean matching (PMM). The imputing method is the same as the SI except that the PMM imputation method is done in m imputed datasets Y(1), . . . , Y(m).

Lastly, the flexible additive imputation model with PMM under multiple imputation method (MI-AregImpute) is performed under the library ‘aregImp’ in the MICE package. This method takes all aspects of uncertainty in imputations into account by using bootstrap to approximate the process of drawing predicted values from a full Bayesian predictive distribution (Frank, 2010). For each multiple imputation, different bootstrap resamples are used and a flexible additive model is fitted on a sample with replacement from the original data. This model is then used to predict all original values for the target variable.

3. Statistical model

Clustered failure time data occur when study subjects from the same cluster share common characteristics such that failure times within the same cluster are correlated. One possible mechanism is the existence of a common risk. When this common risk acts as a factor on hazard function, it is called frailty. The frailty model was first named by Vaupel et al. (1979). It was studied by Clayton (1978) for multivariate failure time data. It is usually used for clustered failure time data.

Let Xi j and Ci j denote the failure and censoring time of the j (= 1, . . . ,mi)th subject in the i (= 1, . . . , M)th cluster, respectively. Furthermore, let Zi j be a p vector fixed covariates. We first assumed that Zi j had no missing values. The observed subject took the form Ti j = min(Xi j,Ci j) and δi j = I(Xi jCi j).The correlation between subjects in the same cluster was incorporated through random effects or frailties Ui(i = 1, . . . , M). Frailties represent the heterogeneity of each cluster. They are assumed to have a particular distribution f (·). We assumed that censoring time was conditionally independent of the failure time given covariates and frailties. The hazard function for subject j in cluster i is given by the following:

$λij(t∣U,Z)=Uiλ0(t)exp(Zi jtβ),$

where U = (Ui, . . . ,UM)T, Z = {Zi j, i, . . . , M; j = 1, . . . ,mi}, and β is a p-dimensional vector of regression coefficients.

In this study, we used mixtures of B-spline for the baseline hazard λ0(·) and the density of the frailty f (·) in formula (3.1) as noted in Sharef et al. (2010). We followed their notations. Markov Chain Monte Carlo (MCMC) was then used to obtain regression coefficients. For the baseline hazard function λ0(·), a non-negative linear combination of Kλ B-spline basis functions Bλk(·) was used. For t ≥ 0, the baseline hazard function was presented as

$λ0 (t∣θλ,ηλ,φλ)=λ0p(t∣ηλ)+φλ [∑k=1KλBλk(t)ωλk-λ0p(t∣ηλ)],$

where θλ is the spline parameter with weight ωλk = eθλk for each basis function, λ0p(·|ηλ) denotes the hazard function for a specified parametric family, and ηλ is the parameter of the parametric family. Including a parametric part enabled us to model the baseline hazard and frailty density curve smoother with less variable fits. Weibull or lognormal distribution is widely used. φλ ∈ [0, 1] is the weight that specifies the degree of confidence in the parametric component. For example, φλ = 0 will lead to a purely parametric Bayesian survival model.

Next, the frailty density function can be written as

$f(x∣θu,ηu,φu)=φu [∑k=1KuB˜uk(x)ωuk+(1-φu)fp(x∣ηu)],$

where θu is the spline parameter and fp(·|ηu) is the density function for a specified parametric family of probability distributions with parameters ηu. The normalized B-spline basis function is written as $B˜uk(x)=Buk(x)·(∫-∞∞Buk(s)ds)-1$ for x ≥ 0 with weight $ωuk={exp(θuk)/∑l=1Kuexp(θul)}·φu∈[0,1]$ is the weight for the parametric component.

For the regression parameter β and spline parameters θλ and θu, multivariate Gaussian distribution with variances $σβ2,σλ2,σu2$ is assumed, respectively. They are independent of each other. The prior distribution of $σβ2,σλ2,σu2$ is assumed to be multivariate Gaussian. Penalty functions pλ(θλ) and pu(θu) are incorporated to induce smoothness in B-spline coefficients and avoid over fitting. For weights φ = (φλ, φu) in formula (3.2) and (3.3), Beta priors with hyperparameters αφu and αφλ can be used.

Let θ = (θλ, θu) denote the set of spline parameters and $σ=(σβ2,σλ2,σu2)$ denote the set of variance parameters. For parametric distribution, denote the set of parameters as η = (ηλ,ηu) with prior distribution πλ(ηλ|τλ) and πu(ηu|τu), respectively. The prior set is denoted as τ = (τλ, τu). It may have hyperparameters αλ = (αηλ, αηu). The log of the posterior density is then given by

$l(U,θ,η,φ,σ,τ∣T,δ,Z)=∑i,jδij[log(Ui)+log λ0(Tij∣θλ,ηλ,φλ)]-∑i,jδijUiΛ0(Tij∣θλ,ηλ,φλ)ezijTβ+∑ilog f(Ui∣θu,ηu,φu)-p2log σβ2+βTβ2σβ2-∑d∈{λ,u}{Kd2log σd2+pd(θd)2σd2}-∑d∈{β,λ,u}{(αd1+1) log σd2+αd2σd2}$$+log {φλ(αφλ1-1)(1-φλ)(αφλ2-1)}+log {φu(αφu1-1)(1-φu)(αφu2-1)}$$+log πλ (ηλ∣τηλ)+log πu (ηu∣τηu)+log πη (ηu∣τηu),$

where (3.4) corresponds to log likelihood l(U, θ,σ|η,φ,T, δ,Z), (3.5) corresponds to Beta priors on weights, and the term (3.6) corresponds to the priors on parametric components. By maximizing the conditional log-likelihood, spline and parametric coefficients θλ, θu,ηλ,ηu are initialized. After initial values are obtained, each set of parameters is updated by Gibbs sampling and Metropolis Hastings MCMC.

When there is no missing covariate, parameters of interest can be obtained as shown above. Since there are missing covariates in the data, we used imputation methods to fill in the missing values. In addition, we also performed CC analysis and obtained coefficients of parameters. Let the complete data be noted by Y = (Yobs, Ymis), where Yobs and Ymis are observed part and missing part of Y, respectively. Also, let P(Y|θ) model be the complete data and θ denote parameters of interest. The posterior predictive distribution for Ymis is given by:

$P(Ymis∣Yobs)=∫P(Ymis∣Yobs,θ)P(θ∣Yobs)dθ,$

where

$P(θ∣Yobs)∝P(θ)∫P(Yobs,Ymis∣θ)dYmis.$

The basic scheme of the process is given as:

• Step 1: For each cluster i = 1, . . . , M, repeat the following steps for j = 1, . . . , Q plausible imputed data sets.

• Imputation step: Generate missing values $Ymisj+1$ from P(Ymis|Yobs, θj).

• Posterior step: Draw parameters θj+1 from $P(θ∣Yobs,Ymisj+1)$.

Repeating the above steps will generate the following Markov Chain $(Ymis1,θ1),(Ymis2,θ2),…,(Ymisj,θj).$

These two steps are iterated with a starting value θ0 until the distribution P(Ymis|Yobs, θ) is stabilized.

• Step 2: For Q imputed data sets, model the baseline hazard function and the frailty density function as formula (3.2) and (3.3), respectively. By applying Bayesian method, estimate coefficients estimator of interest in each Q imputed data sets and denote them as θ̂(1),θ̂(2), . . . ,θ̂(Q).

• Step 3: Combine Q estimates θ̂(1),θ̂(2), . . . ,θ̂(Q) into θ̂ which is the mean of Q estimates. For CC analysis, subjects with missing values are deleted. For SI, only step 1 is needed.

4. Simulation studies

We conducted a simulation study to investigate the performance of various missing data methods under several different data settings. Three covariates Zi j1, Zi j2, Zi j3 (i = 1, . . . , M, j = 1, . . . ,mi) are considered, where Zi j1 has no missing values and Zi j2, Zi j3 have missing values. We assumed the missing mechanism as MAR. Therefore, the missingness in covariates Zi j2, Zi j3 only depends on observed values. We set the distribution of Zi j1 and Zi j3 as $(Z1Z3)~MVN((21),(0.500.150.150.25))$ and assumed Zi j2 had a Bernoulli distribution with mean p = exp(−1 + 2 × Zi j1)/{1 + exp(−1 + 2 × Zi j1)}. Frailties Ui were sampled from LN(1, 0.252). We set coefficients as β = (β1, β2, β3) = (1, −1, 1). We fixed hyperparameters α = 0.01 and all variances of frailties as $σλ2=σu2=0.1$ to achieve suitably noninformative priors. For each cluster i, sample mi clustered failure times Xi j from exponential distribution with intensity Ui exp(β1Zi j1 + β2Zi j2 + β3Zi j3). Censoring time Ci j was also generated from exponential distribution with appropriate mean which resulted in 20% of censoring rate. Then the observed failure time was obtained as Ti j = min(Xi j,Ci j) and the censoring indicator as δi j = I(Xi jCi j). We set the number of clusters as M = 10 and 100 with cluster size as mi = 50 and 5, respectively. Therefore, the whole sample size was 500. For Weibull baseline and lognormal frailty distribution, we set a Beta(1, 2) prior on the weight.

Missing data were then generated as:

$p(rijk=1∣xij,zijk,ρ)=exp(ρ0+ρ1h(xij)+ρ2zij1×h(xij))1+exp(ρ0+ρ1h(xij)+ρ2zij1×h(xij)), k=2,3,$

where ri jk indicates whether zi jk is missing and h(xi j) is an indicator function which is 1 if xi j is in the last quartile of the survival time and 0 otherwise. ρ = (ρ0, ρ1, ρ2) = (−5.45, −2.50, 3), (−4.71, −2.14, 3), (−3.24, −1.39, 3), and (−1.46, −0.68, 3), for 5%, 10%, 25%, and 50% missing, respectively. We replicated the simulation 300 times. Table 1 shows the bias of the regression coefficient estimate β̂ and the estimated standard deviation for frailty σ̂u under various missing rates with five different missing data methods when M = 10 and mi = 50. Figure 1 shows a summary of Table 1. We denote the notation of covariate Zi j1, Zi j2, Zi j3 as Z1, Z2, and Z3 respectively. For covariate Z1, the missing rate corresponds to the whole data missing rate since Z1 has no missing value. We wanted to see how the bias of β̂1 was affected by the missing rate of the whole data. CC has the smallest bias among all other imputation methods when a missing rate is rather small. However, when the missing rate is greater than 10%, the bias is increased rapidly. This shows the explicit drawback of CC analysis. AregImpute showed the best performance for a missing rate of 10% and 15%. SI, MICE, and MICE-PMM imputation methods showed rather unbiased results when missing rate was less than 50%. Especially for AregImpute, the bias decreased even when the missing rate increased up to 25%. This method had the smallest bias among the five methods for almost every missing rate tested.

For discrete covariate Z2, at missing rate up to 5%, the SI method had the smallest bias while CC had the largest bias. This conflicted with the result of covariate Z1, where CC had the smallest bias under a 5% missing rate. SI, MICE, and MICE-PMM showed similar results. However, SI had the largest bias when the missing rate was 50%. AregImpute had the smallest overall bias among all missing data methods. For continuous variable Z3, with missing rate up to 25%, all methods showed similar results with fine estimate results. However, when the missing rate was increased, CC and AregImpute had the largest bias while MICE and MICE-PMM methods showed the best performance.

Table 2 shows the bias of regression coefficient estimates β1, β2, β3 and the estimated standard deviation for frailty σ̂u when the number of cluster is M = 100 and the cluster size is mi = 5. Figure 2 summarizes the data in Table 2 as a graph. For all three covariates, performances of the five different missing data methods shown in Figure 2 are similar to those in Figure 1. For covariate Z1, the AregImpute method showed the smallest bias among all missing data methods.

For covariate Z2, MICE, and MICE-PMM showed a good performance at missing rate below 10%. However, the AregImpute method showed the best performance when the missing rates are above 25%. For covariate Z3, the SI method had the smallest bias for all missing rates. This shows that the SI method is not inferior to MI methods. However, CC showed the worst performance among all missing data methods at all missing rates.

In Figure 3, the bias of the frailty standard deviation estimate σ̂u is shown under the same settings of Figure 1 and Figure 2. Figure 3(a) shows regression coefficient estimates of the frailty when the number of clusters and cluster size are 10 and 50, respectively. Figure 3(b) shows regression coefficient estimates of the frailty when the number of clusters and the cluster size are 100 and 5, respectively. We can see that the bias is affected by the cluster size rather than the number of clusters since Figure 3(a) shows a rather small bias, with SI and MICE-PMM showing the smallest bias. However, the bias is rather large in

5. Example

We applied the missing data methods to analyze the data from biopsy-proven Crohn’s Disease patients who underwent abdominal surgery from January 2000 to December 2009 (Lee et al., 2012). Crohn’s disease (CD) is heterogeneous in nature. The only consistent factor is its inconsistency. There have been studies revealing risk factors for post-operative recurrences. However, only a few studies have evaluated risk factors for reoperation after the primary surgery in CD patients. The primary outcome of the study was time to have re-operation from CD recurrence. Data were collected from 627 patients in 18 different hospitals. Each hospital corresponded to a cluster. Patients in each cluster corresponded to cluster subjects. The number of cluster size ranged from 2 to 150 and the median size was 15. The following covariates were considered as risk factors: age at diagnosis, tumor location, and tumor behavior. These three variables were considered as Montreal classification variables. All these variables were categorical variables. We provided the basic description of each covariate and its missing rate in Table 3. We used four different imputation methods with CC method and obtained regression coefficients using a Bayesian adaptive B-spline method. For baseline hazard, a Weibull model was incorporated in the parametric model part. For the frailty density model, lognormal distribution was assumed. For model selection criterion, we used a modification version of the Deviance Information Criterion (Spiegelhalter et al., 2002). DIC is computed as

$DIC=D¯+V2,$

where and V are the sample mean and the variance of deviance calculated at each MCMC step, respectively. We run the sampling chain for 1500 iterations, discarding the first 500 iterates as burn-in and thinning the chain to every 10th sample.

In Table 4, DIC showed that SI (DIC = 1637.391) or MICE-PMM (DIC = 1716.785) method was a good choice for missing value imputation. These results are consistent with the simulation study. The real data coincides with the simulation case where M = 10, mi = 50 at missing rates of 5% and 10%. In this case, for categorical variable Z2, CC had a larger bias than the other four methods while SI had the smallest bias.

Table 5 shows the posterior mean and 95% posterior intervals for covariate effects and the standard deviation of the frailty. In all five imputation methods, signs and magnitudes of regression coefficients showed similar results.

No significant covariate affected the time to reoperation; however, in the SI method, Behavior B2 was ‘nearly significant’ since its 95% confidence intervals of regression coefficient was (−0.5168, 0.0690) and the upper bound was near a negative value. The effect was not statistically significant at p < 0.05; however, we could assume that patients who had stricturing behavior (B2) tended to have longer time to have a re-operation from CD recurrence compared to patients who had non-stricturing or non-penetrating behavior.

6. Discussion

This paper compared five different missing data methods (CC, SI, and three different multiple imputation methods: MI-MICE, MI-MICE-PMM, and MI-AregImpute) in the presence of missing covariates in clustered data. We estimated the regression coefficient through modeling the unknown baseline hazard and frailty density using B-spline in a Bayesian paradigm with incorporating prior distribution. Modeling is flexible and prior information is available when using a Bayesian approach. In the simulation study, performances of the five missing data methods were different depending on the missing rate and the type of covariate. At a missing rate of 5%, the CC analysis showed good performance to estimate the coefficient for a variable that did not have a missing value. However, CC had the worst performance for a covariate with missing values, even when the missing rate was very small. Single imputation methods showed a better performance than MI methods at a missing rate of 5% for covariates with missing values. This result shows that MI method does not always give better results than SI methods. For categorical variables, aregImpute method produced less biased regression coefficient estimates when 15% or more cases had missing data. However, MICE and MICE-PMM showed good results for continuous variable at a missing rate of 15% (or higher). The performance in handling missing values vary depending on the underlying distribution of covariates and the missing data mechanism. Our simulation results were based on a restricted data setting. Therefore, caution is needed when generalizing the study results. Further study is needed to obtain consistent results under various settings when missing covariates have stratification and time-dependence or skewness. Further simulation is also needed when different distributions of frailty and baseline hazard are assumed

Acknowledgements

This research was supported by the research grant of the Busan University of Foreign Studies in 2017 and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No.2017R1C1B5076671). This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2017R1D1A1B03028279).

Figures
Fig. 1. Regression coefficient estimates using five different missing data methods when M = 10 and mi = 50 under various missing rates.
Fig. 2. Regression coefficient estimates using five different missing data methods when M = 100 and mi = 5 under various missing rates.
Fig. 3. Regression coefficient estimates of the frailty under five different missing data methods with various missing rates.
TABLES

### Table 1

Bias of β̂ and σ̂u under various missing rates using five different missing data methods when M = 10 and mi = 50

Missing rateBias of β̂Bias of σ̂u

β̂1β̂2β̂3
CC5%0.011−0.051−0.052−0.005
10%0.0300.022−0.104−0.012
25%−0.0700.174−0.239−0.024
50%−0.3770.299−0.348−0.034

SI5%−0.114−0.024−0.0530.006
10%−0.1300.053−0.1080.001
25%−0.1280.157−0.2170.012
50%−0.1380.352−0.266−0.019

MICE5%−0.114−0.025−0.0510.006
10%−0.1330.054−0.1070.003
25%−0.1310.156−0.2160.006
50%−0.1430.336−0.255−0.017

MICE-PMM5%−0.115−0.031−0.0510.006
10%−0.1340.045−0.1060.002
25%−0.1280.163−0.2150.007
50%−0.1430.322−0.258−0.019

AregImpute5%−0.037−0.049−0.0510.000
10%−0.0150.003−0.098−0.009
25%0.0060.096−0.230−0.015
50%−0.1460.176−0.351−0.036

### Table 2

Bias of β̂ and σ̂u under various missing rates using five different missing data methods when M = 100 and mi = 5

Missing rateBias of β̂Bias of σ̂u

β̂1β̂2β̂3
CC5%0.025−0.077−0.0430.113
10%0.071−0.002−0.0860.121
25%−0.0260.124−0.2170.151
50%−0.3380.223−0.2990.214

SI5%−0.092−0.047−0.0260.118
10%−0.0870.010−0.0640.131
25%−0.0870.108−0.1650.139
50%−0.1030.297−0.2020.122

MICE5%−0.095−0.044−0.0290.120
10%−0.0850.008−0.0700.130
25%−0.0900.098−0.1730.132
50%−0.1060.296−0.2120.120

MICE-PMM5%−0.098−0.037−0.0300.120
10%−0.0940.020−0.0680.126
25%−0.0950.098−0.1720.134
50%−0.1100.303−0.2050.118

AregImpute5%−0.021−0.078−0.0360.112
10%0.034−0.040−0.0780.125
25%0.0370.047−0.2000.140
50%−0.1140.143−0.2980.182

### Table 3

Covariates and basic description and missing rates for the CD data

NameDescriptionMissing rate
Age at diagnosisA1 less than 16 years0.3%
A2 between 17 and 40 years
A3 older than 40 years
L1 ileal

Tumor locationL2 colonic8.1%
L3 ileocolonic
L4 isolated upper disease

Tumor behaviorB1 non-stricturing, non-penetrating8.1%
B2 stricturing
B3 penetrating

### Table 4

Deviance Information Criterion estimates using five different missing data methods

CCSIMICEMICE-PMMAregImp
1841.0911637.3911757.3151716.7851809.242

### Table 5

Posterior mean and 95% credible intervals of regression coefficients and frailty variance for the CD data using five missing data methods

CovariatesCCMICE-PMM

Posterior mean2.5% quantile97.5% quantilePosterior mean2.5% quantile97.5% quantile
Age linear−0.1366−0.59510.1619−0.1210−0.50470.2270
Location L20.2474−0.16450.73260.2067−0.12600.6014
Location L30.0931−0.25010.45360.0425−0.24860.3089
Location L40.0833−0.27300.59390.1457−0.25040.6150
Behavior B2−0.0897−0.47120.2559−0.0960−0.44370.1797
Behavior B3−0.0765−0.41550.3961−0.0954−0.46960.2143
Variance of frailties0.52360.14921.23870.52270.16091.2889

CovariatesSIAregImp

Posterior mean2.5% quantile97.5% quantilePosterior mean2.5% quantile97.5% quantile

Age linear−0.1595−0.71600.2415−0.1070−0.48100.2161
Location L20.3221−0.00130.68570.1692−0.08930.5136
Location L30.0273−0.27120.2865−0.0082−0.30040.2436
Location L40.0588−0.54350.66160.0631−0.29270.5398
Behavior B2−0.1824−0.51680.0690−0.0940−0.39350.1990
Behavior B3−0.0122−0.34030.2794−00258−0.33680.2837
Variance of frailties0.53440.22250.94470.52780.14651.2583

CovariatesMICE

Posterior mean2.5% quantile97.5% quantile

Age linear−0.1043−0.51040.2433
Location L20.1618−0.13440.5570
Location L30.0093−0.26650.2621
Location L40.0648−0.30290.4223
Behavior B2−0.0512−0.39190.2307
Behavior B3−0.0314−0.32520.2020
Variance of frailties0.51770.17271.0509

References
1. Brand, JPL (1999). Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets. Rotterdam: Erasmus University
2. Chen, HY, and Little, RJA (1999). Proportional hazards regression with missing covariates. Journal of the American Statistical Associations. 94, 896-908.
3. Clayton, DG (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika. 65, 141-152.
4. Frank, H (2010). Hmisc: Miscellaneous library for R statistical software. R package 3.9-0
5. Heitjan, DF, and Little, RJA (1991). Multiple imputation for the fatal accident reporting system. Journal of the Royal Statistical Society. Series C (Applied Statistics). 40, 13-29.
6. Lee, KY, Yu, CS, Lee, KY, Cho, YB, Park, KJ, Choi, GS, Yoon, SN, and Yoo, H (2012). Risk factors for repeat abdominal surgery in Korean patients with Crohn’s disease: a multiple-center study of a Korean inflammatory bowel disease study group. Journal of the Korean Society of Coloproctology. 28, 188-194.
7. Lipsitz, SR, and Ibrahim, JG (1996). Using the EM algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis. 2, 5-14.
8. Lipsitz, SR, and Ibrahim, JG (2000). Estimation with correlated censored survival data with missing covariates. Biostatistics. 1, 315-327.
9. Marshall, A, Altman, DG, Royston, P, and Holder, RL (2010). Comparison of techniques for handling missing covariate data within prognostic modeling studies: a simulation study. BMC Medical Research Methodology. 10.
10. Rubin, DB (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons
11. Schenker, N, and Taylor, JMG (1996). Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis. 22, 425-446.
12. Sharef, E, Strawderman, RL, Ruppert, D, Cowen, M, and Halasyamani, L (2010). Bayesian adaptive B-spline estimation in proportional hazard frailty models. Electronic Journal of Statistics. 4, 606-642.
13. Spiegelhalter, DJ, Best, NG, Carlin, BP, and Van der Linde, A (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 64, 583-639.
14. van Buuren, S (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research. 16, 219-242.
15. van Buuren, S, Boshuizen, HC, and Knook, DL (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 18, 681-694.
16. Vaupel, JW, Manton, KG, and Stallard, E (1979). The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography. 16, 439-454.
17. Zhou, H, and Pepe, MS (1995). Auxiliary covariate data in failure time regression. Biometrika. 82, 139-149.