In many epidemiological studies, missing values in the outcome arise due to censoring. Such censoring is what makes survival analysis special and differentiated from other analytical methods. There are many methods that deal with censored data in survival analysis. However, few studies have dealt with missing covariates in survival data. Furthermore, studies dealing with missing covariates are rare when data are clustered. In this paper, we conducted a simulation study to compare results of several missing data methods when data had clustered multi-structured type with missing covariates. In this study, we modeled unknown baseline hazard and frailty with Bayesian B-Spline to obtain more smooth and accurate estimates. We also used prior information to achieve more accurate results. We assumed the missing mechanism as MAR. We compared the performance of five different missing data techniques and compared these results through simulation studies. We also presented results from a Multi-Center study of Korean IBD patients with Crohn’s disease (Lee
In epidemiological studies, missing covariate is a common problem in survival data and is often encountered in many statistical applications. The easiest way to deal with missing covariates is to discard missing observations and use only completely observed subjects. However, using only observed subjects may lead to biased estimates of parameters. As the proportion of missing data increases, there will be a substantial loss of efficiency. Rather than discarding missing data, statistical approaches can be applied under the presence of missing covariates. Zhou and Pepe (1995) have proposed an estimated partial likelihood method to estimate relative risk parameters. Lipsitz and Ibrahim (1996) have extended the EM by the method of weights to survival outcomes whose distributions may not fall in the class of generalized linear models. Chen and Little (1999) have used a nonparametric maximum likelihood to estimate regression parameters in a proportional hazards regression model with missing covariates. All these methods mentioned above can be used with the Cox model for univariate survival data with missing covariates. Imputing the missing data can be performed instead of using a likelihood based method to handle missing covariates. Likelihood based methods can be sophisticated. They generally require problem-specific programs. However, imputing the missing data enables the use of whole data and analysis can be performed easily using existing methods. Marshall
In this paper, we extended the Bayesian adaptive B-spline estimation method to clustered survival data with missing covariates. This study compares built-in imputation methods with CC analysis in the presence of missing covariates under clustered survival data using a Bayesian approach. We used the Bayesian adaptive B-spline estimation method proposed by Sharef
Missing covariates are imputed through four different imputation methods with CC analysis. These imputation methods included one SI and three multiple imputation methods. We will briefly describe each method below.
The simplest way to handle missing values in the data is to use only observed values (that is, only CC is used). The CC estimator is unbiased when the missing mechanism is missing complete at random (MCAR). However, if there are many variables with missing values, then a large proportion of observations may be dropped. This can cause bias in missing at random (MAR) and lose efficiency in MCAR.
Imputing a missing value with a predicted value is called predictive mean matching. It is used to impute a value randomly from a set of observed values whose predicted values are closest to predicted values from a specified regression model (Heitjan and Little 1991; Schenker and Taylor 1996).
Let
Multiple imputation is a three-step approach in estimating incomplete data regression models (Rubin, 1987). The notation of
The MI-MICE is an imputation method using multivariate imputation by chained equations. MICE is a software for imputing incomplete multivariate data by fully conditional specification. It appeared in the R package library in 2001. Fully conditional method specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each for incomplete variable (van Buuren, 2007). Under the assumption that a multivariate distribution exists from conditional distributions, MICE constructs a Gibbs sampler from specified conditionals. Let
where
MI-MICE-PMM is a method that imputes missing values under multiple imputation with regression switching (MICE) using predictive mean matching (PMM). The imputing method is the same as the SI except that the PMM imputation method is done in
Lastly, the flexible additive imputation model with PMM under multiple imputation method (MI-AregImpute) is performed under the library ‘aregImp’ in the MICE package. This method takes all aspects of uncertainty in imputations into account by using bootstrap to approximate the process of drawing predicted values from a full Bayesian predictive distribution (Frank, 2010). For each multiple imputation, different bootstrap resamples are used and a flexible additive model is fitted on a sample with replacement from the original data. This model is then used to predict all original values for the target variable.
Clustered failure time data occur when study subjects from the same cluster share common characteristics such that failure times within the same cluster are correlated. One possible mechanism is the existence of a common risk. When this common risk acts as a factor on hazard function, it is called frailty. The frailty model was first named by Vaupel
Let
where
In this study, we used mixtures of B-spline for the baseline hazard
where
Next, the frailty density function can be written as
where
For the regression parameter
Let
where (
When there is no missing covariate, parameters of interest can be obtained as shown above. Since there are missing covariates in the data, we used imputation methods to fill in the missing values. In addition, we also performed CC analysis and obtained coefficients of parameters. Let the complete data be noted by
where
The basic scheme of the process is given as:
Step 1: For each cluster
Imputation step: Generate missing values
Posterior step: Draw parameters
Repeating the above steps will generate the following Markov Chain
These two steps are iterated with a starting value
Step 2: For
Step 3: Combine
We conducted a simulation study to investigate the performance of various missing data methods under several different data settings. Three covariates
Missing data were then generated as:
where
For discrete covariate
Table 2 shows the bias of regression coefficient estimates
For covariate
In Figure 3, the bias of the frailty standard deviation estimate
We applied the missing data methods to analyze the data from biopsy-proven Crohn’s Disease patients who underwent abdominal surgery from January 2000 to December 2009 (Lee
where
In Table 4, DIC showed that SI (DIC = 1637.391) or MICE-PMM (DIC = 1716.785) method was a good choice for missing value imputation. These results are consistent with the simulation study. The real data coincides with the simulation case where
Table 5 shows the posterior mean and 95% posterior intervals for covariate effects and the standard deviation of the frailty. In all five imputation methods, signs and magnitudes of regression coefficients showed similar results.
No significant covariate affected the time to reoperation; however, in the SI method, Behavior B2 was ‘nearly significant’ since its 95% confidence intervals of regression coefficient was (−0.5168, 0.0690) and the upper bound was near a negative value. The effect was not statistically significant at
This paper compared five different missing data methods (CC, SI, and three different multiple imputation methods: MI-MICE, MI-MICE-PMM, and MI-AregImpute) in the presence of missing covariates in clustered data. We estimated the regression coefficient through modeling the unknown baseline hazard and frailty density using B-spline in a Bayesian paradigm with incorporating prior distribution. Modeling is flexible and prior information is available when using a Bayesian approach. In the simulation study, performances of the five missing data methods were different depending on the missing rate and the type of covariate. At a missing rate of 5%, the CC analysis showed good performance to estimate the coefficient for a variable that did not have a missing value. However, CC had the worst performance for a covariate with missing values, even when the missing rate was very small. Single imputation methods showed a better performance than MI methods at a missing rate of 5% for covariates with missing values. This result shows that MI method does not always give better results than SI methods. For categorical variables, aregImpute method produced less biased regression coefficient estimates when 15% or more cases had missing data. However, MICE and MICE-PMM showed good results for continuous variable at a missing rate of 15% (or higher). The performance in handling missing values vary depending on the underlying distribution of covariates and the missing data mechanism. Our simulation results were based on a restricted data setting. Therefore, caution is needed when generalizing the study results. Further study is needed to obtain consistent results under various settings when missing covariates have stratification and time-dependence or skewness. Further simulation is also needed when different distributions of frailty and baseline hazard are assumed
This research was supported by the research grant of the Busan University of Foreign Studies in 2017 and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No.2017R1C1B5076671). This research was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2017R1D1A1B03028279).
Bias of
Missing rate | Bias of | Bias of | |||
---|---|---|---|---|---|
CC | 5% | 0.011 | −0.051 | −0.052 | −0.005 |
10% | 0.030 | 0.022 | −0.104 | −0.012 | |
25% | −0.070 | 0.174 | −0.239 | −0.024 | |
50% | −0.377 | 0.299 | −0.348 | −0.034 | |
SI | 5% | −0.114 | −0.024 | −0.053 | 0.006 |
10% | −0.130 | 0.053 | −0.108 | 0.001 | |
25% | −0.128 | 0.157 | −0.217 | 0.012 | |
50% | −0.138 | 0.352 | −0.266 | −0.019 | |
MICE | 5% | −0.114 | −0.025 | −0.051 | 0.006 |
10% | −0.133 | 0.054 | −0.107 | 0.003 | |
25% | −0.131 | 0.156 | −0.216 | 0.006 | |
50% | −0.143 | 0.336 | −0.255 | −0.017 | |
MICE-PMM | 5% | −0.115 | −0.031 | −0.051 | 0.006 |
10% | −0.134 | 0.045 | −0.106 | 0.002 | |
25% | −0.128 | 0.163 | −0.215 | 0.007 | |
50% | −0.143 | 0.322 | −0.258 | −0.019 | |
AregImpute | 5% | −0.037 | −0.049 | −0.051 | 0.000 |
10% | −0.015 | 0.003 | −0.098 | −0.009 | |
25% | 0.006 | 0.096 | −0.230 | −0.015 | |
50% | −0.146 | 0.176 | −0.351 | −0.036 |
Bias of
Missing rate | Bias of | Bias of | |||
---|---|---|---|---|---|
CC | 5% | 0.025 | −0.077 | −0.043 | 0.113 |
10% | 0.071 | −0.002 | −0.086 | 0.121 | |
25% | −0.026 | 0.124 | −0.217 | 0.151 | |
50% | −0.338 | 0.223 | −0.299 | 0.214 | |
SI | 5% | −0.092 | −0.047 | −0.026 | 0.118 |
10% | −0.087 | 0.010 | −0.064 | 0.131 | |
25% | −0.087 | 0.108 | −0.165 | 0.139 | |
50% | −0.103 | 0.297 | −0.202 | 0.122 | |
MICE | 5% | −0.095 | −0.044 | −0.029 | 0.120 |
10% | −0.085 | 0.008 | −0.070 | 0.130 | |
25% | −0.090 | 0.098 | −0.173 | 0.132 | |
50% | −0.106 | 0.296 | −0.212 | 0.120 | |
MICE-PMM | 5% | −0.098 | −0.037 | −0.030 | 0.120 |
10% | −0.094 | 0.020 | −0.068 | 0.126 | |
25% | −0.095 | 0.098 | −0.172 | 0.134 | |
50% | −0.110 | 0.303 | −0.205 | 0.118 | |
AregImpute | 5% | −0.021 | −0.078 | −0.036 | 0.112 |
10% | 0.034 | −0.040 | −0.078 | 0.125 | |
25% | 0.037 | 0.047 | −0.200 | 0.140 | |
50% | −0.114 | 0.143 | −0.298 | 0.182 |
Covariates and basic description and missing rates for the CD data
Name | Description | Missing rate |
---|---|---|
Age at diagnosis | A1 less than 16 years | 0.3% |
A2 between 17 and 40 years | ||
A3 older than 40 years | ||
L1 ileal | ||
Tumor location | L2 colonic | 8.1% |
L3 ileocolonic | ||
L4 isolated upper disease | ||
Tumor behavior | B1 non-stricturing, non-penetrating | 8.1% |
B2 stricturing | ||
B3 penetrating |
Deviance Information Criterion estimates using five different missing data methods
CC | SI | MICE | MICE-PMM | AregImp |
---|---|---|---|---|
1841.091 | 1637.391 | 1757.315 | 1716.785 | 1809.242 |
Posterior mean and 95% credible intervals of regression coefficients and frailty variance for the CD data using five missing data methods
Covariates | CC | MICE-PMM | ||||
---|---|---|---|---|---|---|
Posterior mean | 2.5% quantile | 97.5% quantile | Posterior mean | 2.5% quantile | 97.5% quantile | |
Age linear | −0.1366 | −0.5951 | 0.1619 | −0.1210 | −0.5047 | 0.2270 |
Age quad | −0.0455 | −0.3359 | 0.2131 | −0.0881 | −0.3957 | 0.2028 |
Location L2 | 0.2474 | −0.1645 | 0.7326 | 0.2067 | −0.1260 | 0.6014 |
Location L3 | 0.0931 | −0.2501 | 0.4536 | 0.0425 | −0.2486 | 0.3089 |
Location L4 | 0.0833 | −0.2730 | 0.5939 | 0.1457 | −0.2504 | 0.6150 |
Behavior B2 | −0.0897 | −0.4712 | 0.2559 | −0.0960 | −0.4437 | 0.1797 |
Behavior B3 | −0.0765 | −0.4155 | 0.3961 | −0.0954 | −0.4696 | 0.2143 |
Variance of frailties | 0.5236 | 0.1492 | 1.2387 | 0.5227 | 0.1609 | 1.2889 |
Covariates | SI | AregImp | ||||
Posterior mean | 2.5% quantile | 97.5% quantile | Posterior mean | 2.5% quantile | 97.5% quantile | |
Age linear | −0.1595 | −0.7160 | 0.2415 | −0.1070 | −0.4810 | 0.2161 |
Age quad | −0.1035 | −0.5017 | 0.2009 | −0.0486 | −0.3738 | 0.2752 |
Location L2 | 0.3221 | −0.0013 | 0.6857 | 0.1692 | −0.0893 | 0.5136 |
Location L3 | 0.0273 | −0.2712 | 0.2865 | −0.0082 | −0.3004 | 0.2436 |
Location L4 | 0.0588 | −0.5435 | 0.6616 | 0.0631 | −0.2927 | 0.5398 |
Behavior B2 | −0.1824 | −0.5168 | 0.0690 | −0.0940 | −0.3935 | 0.1990 |
Behavior B3 | −0.0122 | −0.3403 | 0.2794 | −00258 | −0.3368 | 0.2837 |
Variance of frailties | 0.5344 | 0.2225 | 0.9447 | 0.5278 | 0.1465 | 1.2583 |
Covariates | MICE | |||||
Posterior mean | 2.5% quantile | 97.5% quantile | ||||
Age linear | −0.1043 | −0.5104 | 0.2433 | |||
Age quad | −0.0567 | −0.3539 | 0.2395 | |||
Location L2 | 0.1618 | −0.1344 | 0.5570 | |||
Location L3 | 0.0093 | −0.2665 | 0.2621 | |||
Location L4 | 0.0648 | −0.3029 | 0.4223 | |||
Behavior B2 | −0.0512 | −0.3919 | 0.2307 | |||
Behavior B3 | −0.0314 | −0.3252 | 0.2020 | |||
Variance of frailties | 0.5177 | 0.1727 | 1.0509 |