It is important to identify informative variables in high dimensional data analysis; however, it becomes a challenging task when covariates are contaminated by measurement error due to the bias induced by measurement error. In this article, we present a two-step approach for variable selection in the presence of measurement error. In the first step, we directly select important variables from the contaminated covariates as if there is no measurement error. We then apply, in the following step, orthogonal regression to obtain the unbiased estimates of regression coefficients identified in the previous step. In addition, we propose a modification of the two-step approach to further enhance the variable selection performance. Various simulation studies demonstrate the promising performance of the proposed method.
With the growth of high dimensional data, variable selection becomes a primal task in statistical learning. Since the prediction accuracy of final models relies heavily on selected variables. Regularization has become one of the canonical approaches for variable selection due to its fast and promising performance under high-dimensional setups since the proposal of the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996). In addition to the
Measurement error in variables is commonly observed in practice. Let the dataset be (
where
where
Numerous methods have been developed for measurement error models (Carroll
In this paper, we propose a two-step procedure for variable selection in linear regression with measurement errors. The proposed process conducts selection and estimation separately. We firstly identify important variables without considering measurement error by applying conventional regularized methods directly to (
Orthogonal regression has been regarded as one of the popular choices for bias correction. The orthogonal regression assumes that
where
To illustrate this, we consider a set of data (
where
The least squares estimate for (
We develop a two-step procedure for variable selection in linear regression with measurement error. The key idea of the two-step procedure is to separate selection and estimation. In the first step we consider the following regularized linear regression for (
where
Notice that
in the first step.
Given , we can estimate by applying the orthogonal regression of
where , , , and is the covariance matrix of .
The proposed two-step method shows promising performance; however, we empirically observe that uninformative variables are often selected in the first step due to the additional variability of
However, we also remark that this random partitioning approach may not work well when the sample size is low and/or the signal of informative variables are not strong because the signal is too weak to detect from half of the data. Therefore we recommend in practice to employ this modification only when the sample size is large enough.
In the orthogonal regression (
where
where
be the difference between the two estimates. We propose to exploit the simulation-extrapolation (SIMEX) (Cook and Stefanski, 1994) to estimate Δ(
In particular, we denote
since
and finally, the SIMEX estimator of Δ_{SIMEX}(
and the estimated variance
Finally, we can find
We conduct a simulation to investigate the performance of the two-step variable selection procedure and
Setting (
where
Independent (IND):
Autoregressive (AR):
For
For performance evaluation, we report the three values with their standard errors:
TP: averaged true positives: the # of important variables selected in the first step.
FP: averaged false positive: the # of unimportant variables selected in the first step.
MSE: the median of squared error ||
The first two measures, TP and FP quantify the performance of the variable selection. MSE measures the accuracy of coefficient estimates. In our simulation study, perfect methods have 5 TP, 0 FP and the lowest MSE.
Table 1 and Table 2 report the simulation results when the measurement errors have independent and autoregressive structures, respectively. It is observed that our two-step approach outperforms the POR which is a one-step approach. Comparing the two versions of the two-step approach, the modified version substantially reduces false positives. We also note that our two-step approach using LASSO and MCP performs better than the one-step approach under the all scenarios in considered.
We next conduct additional simulations to examine the performance of our method to estimate
Table 3 describes the average estimated
For the real data illustration, we use the Boston housing data (Harrison and Rubinfeld, 1978) available in R. The response is the logarithm of the median value of owner-occupied homes in the Boston areas. The data originally contains thirteen predictors which are not contaminated. In order to check the performance of the proposed method under the presence of measurement error, we first exclude two discrete predictors and marginally standardize the eleven continuous predictors. We then generate (
In this paper, we develop a two-step variable selection method for measurement error models. The proposed method is based on the idea of separating selection and estimation; it first selects significant variables from contaminated covariates and then obtains the orthogonal regression estimates of the selected variables. Furthermore, we suggested a SIMEX-based method to estimate
This work is supported by National Research Foundation of Korea Grant (NRF-2018R1D1A1B07043 034).
Simulation result for independent predictors
Model | Method | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
TP | FP | MSE | TP | FP | MSE | ||||||
M1 | Oracle | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) |
OLS | 5.00 | (0.00) | 15.00 | (0.00) | 0.254 (0.052) | 5.00 | (0.00) | 15.00 | (0.00) | 1.301 (0.122) | |
OR | 5.00 | (0.00) | 15.00 | (0.00) | 0.115 (0.038) | 5.00 | (0.00) | 15.00 | (0.00) | 0.507 (0.235) | |
PLS | 5.00 | (0.00) | 0.74 | (1.32) | 0.206 (0.051) | 5.00 | (0.00) | 1.21 | (1.72) | 1.254 (0.120) | |
POR | 5.00 | (0.00) | 1.12 | (2.55) | 0.046 (0.043) | 5.00 | (0.00) | 1.84 | (1.74) | 0.236 (0.178) | |
TS1 | 5.00 | (0.00) | 0.74 | (1.50) | 0.037 (0.035) | 5.00 | (0.00) | 1.21 | (2.08) | 0.182 (0.159) | |
TS2 | 5.00 | (0.00) | 0.54 | (1.04) | 0.035 (0.027) | 5.00 | (0.00) | 0.93 | (1.23) | 0.142 (0.130) | |
M2 | Oracle | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) |
OLS | 5.00 | (0.00) | 15.00 | (0.00) | 0.270 (0.053) | 5.00 | (0.00) | 15.00 | (0.00) | 1.351 (0.128) | |
OR | 5.00 | (0.00) | 15.00 | (0.00) | 0.124 (0.039) | 5.00 | (0.00) | 15.00 | (0.00) | 0.560 (0.201) | |
PLS | 5.00 | (0.00) | 0.78 | (1.88) | 0.224 (0.051) | 5.00 | (0.00) | 1.52 | (2.00) | 1.296 (0.126) | |
POR | 5.00 | (0.00) | 1.58 | (2.64) | 0.055 (0.048) | 5.00 | (0.00) | 2.04 | (2.38) | 0.281 (0.203) | |
TS1 | 5.00 | (0.00) | 0.78 | (1.75) | 0.037 (0.034) | 5.00 | (0.00) | 1.52 | (2.48) | 0.192 (0.165) | |
TS2 | 5.00 | (0.00) | 0.68 | (1.36) | 0.034 (0.031) | 5.00 | (0.00) | 1.08 | (2.00) | 0.136 (0.147) | |
M3 | Oracle | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) |
OLS | 5.00 | (0.00) | 15.00 | (0.00) | 1.817 (0.301) | 5.00 | (0.00) | 15.00 | (0.00) | 9.922 (0.909) | |
OR | 5.00 | (0.00) | 15.00 | (0.00) | 0.534 (0.165) | 5.00 | (0.00) | 15.00 | (0.00) | 2.695 (1.189) | |
PLS | 5.00 | (0.00) | 1.45 | (1.62) | 1.641 (0.314) | 4.95 | (0.20) | 2.94 | (2.55) | 9.871 (0.951) | |
POR | 5.00 | (0.00) | 1.86 | (1.81) | 0.267 (0.173) | 4.94 | (0.20) | 3.41 | (2.68) | 1.604 (1.086) | |
TS1 | 5.00 | (0.00) | 1.45 | (1.66) | 0.236 (0.159) | 4.95 | (0.20) | 2.94 | (2.56) | 1.361 (0.900) | |
TS2 | 5.00 | (0.00) | 1.00 | (1.44) | 0.194 (0.135) | 4.85 | (0.34) | 2.33 | (2.30) | 1.179 (0.988) |
Averaged TP, FP, and MSE over 100 independent repetitions are reported under (M1)–(M3) where ∑
Simulation result for autoregressive predictors
Model | Method | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
TP | FP | MSE | TP | FP | MSE | ||||||
M1 | Oracle | 5.00 | (0.00) | 0.00 | (0.00) | 0.011 (0.013) | 5.00 | (0.00) | 0.00 | (0.00) | 0.011 (0.013) |
OLS | 5.00 | (0.00) | 15.00 | (0.00) | 0.048 (0.014) | 5.00 | (0.00) | 15.00 | (0.00) | 0.134 (0.034) | |
OR | 5.00 | (0.00) | 15.00 | (0.00) | 0.047 (0.014) | 5.00 | (0.00) | 15.00 | (0.00) | 0.078 (0.023) | |
PLS | 5.00 | (0.00) | 0.85 | (1.42) | 0.013 (0.010) | 5.00 | (0.00) | 0.86 | (2.11) | 0.085 (0.035) | |
POR | 5.00 | (0.00) | 0.50 | (2.12) | 0.012 (0.013) | 5.00 | (0.00) | 1.21 | (2.16) | 0.025 (0.023) | |
TS1 | 5.00 | (0.00) | 0.85 | (1.50) | 0.015 (0.015) | 5.00 | (0.00) | 0.86 | (1.72) | 0.025 (0.024) | |
TS2 | 5.00 | (0.00) | 0.65 | (1.44) | 0.014 (0.011) | 5.00 | (0.00) | 0.65 | (2.04) | 0.023 (0.021) | |
M2 | Oracle | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) |
OLS | 5.00 | (0.00) | 15.00 | (0.00) | 0.045 (0.014) | 5.00 | (0.00) | 15.00 | (0.00) | 0.086 (0.026) | |
OR | 5.00 | (0.00) | 15.00 | (0.00) | 0.046 (0.014) | 5.00 | (0.00) | 15.00 | (0.00) | 0.065 (0.020) | |
PLS | 5.00 | (0.00) | 1.08 | (1.57) | 0.014 (0.010) | 5.00 | (0.00) | 1.73 | (2.92) | 0.045 (0.026) | |
POR | 5.00 | (0.00) | 0.47 | (1.61) | 0.012 (0.009) | 5.00 | (0.00) | 1.17 | (2.68) | 0.018 (0.018) | |
TS1 | 5.00 | (0.00) | 1.08 | (1.66) | 0.017 (0.016) | 5.00 | (0.00) | 1.73 | (2.78) | 0.023 (0.020) | |
TS2 | 5.00 | (0.00) | 0.74 | (1.46) | 0.015 (0.013) | 5.00 | (0.00) | 1.44 | (2.11) | 0.022 (0.019) | |
M3 | Oracle | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) | 5.00 | (0.00) | 0.00 | (0.00) | 0.010 (0.013) |
OLS | 5.00 | (0.00) | 15.00 | (0.00) | 0.064 (0.019) | 5.00 | (0.00) | 15.00 | (0.00) | 0.343 (0.074) | |
OR | 5.00 | (0.00) | 15.00 | (0.00) | 0.059 (0.017) | 5.00 | (0.00) | 15.00 | (0.00) | 0.158 (0.044) | |
PLS | 5.00 | (0.00) | 1.00 | (1.74) | 0.022 (0.015) | 5.00 | (0.00) | 4.27 | (4.01) | 0.223 (0.089) | |
POR | 5.00 | (0.00) | 1.22 | (2.13) | 0.016 (0.015) | 5.00 | (0.00) | 3.46 | (3.52) | 0.094 (0.055) | |
TS1 | 5.00 | (0.00) | 1.00 | (1.71) | 0.021 (0.020) | 5.00 | (0.00) | 4.27 | (4.24) | 0.083 (0.050) | |
TS2 | 5.00 | (0.00) | 0.77 | (1.62) | 0.019 (0.016) | 5.00 | (0.00) | 3.43 | (3.64) | 0.078 (0.047) |
Averaged TP, FP, and MSE over 100 independent repetitions are reported under (M1)–(M3) where ∑
Simulation result for
AVER | SE | AVER | SE | |
---|---|---|---|---|
0.50 | 0.512 | (0.090) | 0.638 | (0.181) |
0.70 | 0.717 | (0.093) | 0.834 | (0.222) |
1.00 | 1.016 | (0.110) | 1.118 | (0.255) |
1.30 | 1.303 | (0.119) | 1.409 | (0.269) |
1.50 | 1.526 | (0.129) | 1.600 | (0.266) |
The averaged estimated
Real-data-based comparison results
Penalty | PLS | TS | ||||
---|---|---|---|---|---|---|
20 | LASSO | 0.742 | (0.065) | 0.637 | (0.124) | 0.000 |
SCAD | 0.698 | (0.081) | 0.658 | (0.176) | 0.013 | |
MCP | 0.696 | (0.081) | 0.686 | (0.220) | 0.333 | |
30 | LASSO | 0.644 | (0.049) | 0.557 | (0.076) | 0.000 |
SCAD | 0.607 | (0.078) | 0.584 | (0.104) | 0.000 | |
MCP | 0.606 | (0.077) | 0.591 | (0.105) | 0.004 |
Averaged root mean square error over 100 independent repetitions along with the corresponding standard deviations. The last column contains the