Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the “large
Much work has been done in regression diagnostics since Cook’s distance (Cook, 1977) was introduced forty years ago. The concept of regression diagnostics, considered in the classical linear model, has been extended to the Box-Cox transformation model (Box and Cox, 1964), ridge regression model (Hoerl and Kennard, 1970), and nonparametric regression models such as spline smoothing model, local polynomial regression, semiparametric model, and varying coefficient model. All regression diagnostic results in various regression models are done under the assumption of “large
High-dimensional data (small
In this paper, we study diagnostic issues in the LASSO regression model for high-dimensional data. Kim
In high-dimensional data, the influence of one or few observations on some estimators could be more serious and important because the number of observations is small compared to the “large
This paper is organized as follows. In Section 2, the difference between LASSO estimates based on the full samples and the partial samples after deleting some observations, respectively, is derived under simple setup of the design. Numerical studies based on artificial data sets are done in Section 3, and an illustrative example based on a real data set is given in Section 4. Finally, concluding remarks are given in Section 5.
Consider a simple linear regression model with no intercept, i.e.,
where ∑
where sgn(
Note that
Now, by the first derivative of
Therefore, by noting
i.e.,
Since the second term of
Therefore,
which completes the proof.
As a simple consequence of Proposition 1, the LASSO estimator based on (
Consider a simple linear regression model
where ∑
Step 1: Generate
Step 2: Generate
Step 3: Let
Step 4: Do centering and scaling to meet Σ
The LSE based on the artificial data is
In this numerical study, we investigate three important aspects of the influence of the
Consider a linear regression model
where
Step 1: Generate each
Step 2: Generate
Step 3: Let
Simulation (I) - Sensitivity of LASSO to a single observation
In this simulation study, we want to see that the number of selected variables via the LASSO is very sensitive to the deletion of one observation. Table 1 shows the LASSO estimators based on the full samples (i.e.,
Simulation (II) - Sensitivity of LASSO to a single outlier (
Here, we want to see the variable selection performance of the LASSO in the outlier model (a model with one outlying observation) when the tuning parameter is given so that the LASSO selects the 4 variables, i.e., true number of variables. We considered three outlier models, where each model contains outlier with
Simulation (III) - Sensitivity of LASSO to a single outlier (
Finally, we want to see the variable selection performance of the LASSO in the outlier model when the tuning parameter is estimated by the cross-validation, for example. To do this we compute the average proportion of the number of the selected variables by LASSO out of 100 replications. If the number of selected variables less than four, we assumed that 4 variables are selected by the LASSO. Table 3 shows the average proportion of the number of correctly selected variables. We considered three outlier models, where each model contains an outlier with
Even though the formula for the multiple cases deletion was given in Proposition 1, the simulation study was done for a single case deletion only. Due to different aspects of multiple cases deletion (swamping phenomenon and masking effect), further simulation studies for multiple cases deletion would be helpful.
As an illustrative example for the influence of observations on the variable selection in LASSO, we use the brain aging data of Lu
Table 4 shows that the LASSO selects 25 variables among 403 variables. If we delete one observation, then most cases result in selecting around 25 variables except the cases of deleting the 8
One or few observations could be very influential on estimators in “small
For further studies, it will be worth studying the basic building blocks which affect variable selection results. Despite difficulties, it is also necessary to modify the LASSO model to a robust LASSO that is not sensitive to outliers.
This work was supported by a 2-year Research Grant of Pusan National University.
Selected variables by the LASSO with the full data set (
Deleted observation | Selected variables | Number of selected variables |
---|---|---|
Original data | 1 2 3 4 | 4 |
1 | 1 2 3 4 52 96 | 6 |
2 | 1 2 3 4 10 22 42 48 68 99 | 10 |
3 | 1 2 3 4 48 62 | 6 |
4 | 1 2 3 4 | 4 |
5 | 1 2 3 4 | 4 |
6 | 1 2 3 4 76 | 5 |
7 | 1 2 3 4 | 4 |
8 | 1 2 3 4 10 | 5 |
9 | 1 2 3 4 | 4 |
10 | 1 2 3 4 10 42 | 6 |
11 | 1 2 3 4 48 | 5 |
12 | 1 2 3 4 48 | 5 |
13 | 1 2 3 4 36 48 | 6 |
14 | 1 2 3 4 10 46 48 66 96 | 9 |
15 | 1 2 3 4 36 | 5 |
16 | 4 | 1 |
17 | 1 2 3 4 | 4 |
18 | 1 2 3 4 36 48 62 76 | 8 |
19 | 1 2 3 4 48 | 5 |
20 | 1 2 3 4 | 4 |
The average proportion of the number of correctly selected variables among 4 variables selected by the LASSO out of 100 replications
Deleted observation | |||
---|---|---|---|
None | 0.294 | 0.186 | 0.120 |
1 | 0.512 | 0.492 | 0.441 |
2 | 0.296 | 0.168 | 0.112 |
3 | 0.284 | 0.155 | 0.109 |
4 | 0.272 | 0.173 | 0.108 |
5 | 0.283 | 0.170 | 0.104 |
6 | 0.304 | 0.174 | 0.097 |
7 | 0.274 | 0.171 | 0.104 |
8 | 0.283 | 0.177 | 0.109 |
9 | 0.292 | 0.157 | 0.109 |
10 | 0.295 | 0.171 | 0.102 |
11 | 0.280 | 0.148 | 0.102 |
12 | 0.299 | 0.170 | 0.095 |
13 | 0.274 | 0.172 | 0.104 |
14 | 0.283 | 0.165 | 0.104 |
15 | 0.290 | 0.181 | 0.093 |
16 | 0.296 | 0.189 | 0.100 |
17 | 0.281 | 0.160 | 0.096 |
18 | 0.287 | 0.172 | 0.104 |
19 | 0.290 | 0.171 | 0.093 |
20 | 0.291 | 0.185 | 0.110 |
Three outlier models with
The average proportion of the number of correctly selected variables by the LASSO out of 100 replications when the tuning parameter is estimated by the cross-validation
Deleted observation | |||
---|---|---|---|
None | 0.490 | 0.350 | 0.220 |
1 | 0.550 | 0.505 | 0.530 |
2 | 0.480 | 0.280 | 0.210 |
3 | 0.475 | 0.330 | 0.200 |
4 | 0.465 | 0.330 | 0.225 |
5 | 0.450 | 0.310 | 0.230 |
6 | 0.480 | 0.295 | 0.215 |
7 | 0.480 | 0.325 | 0.220 |
8 | 0.465 | 0.320 | 0.200 |
9 | 0.455 | 0.325 | 0.190 |
10 | 0.485 | 0.300 | 0.225 |
11 | 0.470 | 0.295 | 0.215 |
12 | 0.465 | 0.300 | 0.235 |
13 | 0.450 | 0.300 | 0.210 |
14 | 0.475 | 0.310 | 0.195 |
15 | 0.450 | 0.275 | 0.240 |
16 | 0.490 | 0.315 | 0.190 |
17 | 0.480 | 0.305 | 0.235 |
18 | 0.460 | 0.310 | 0.210 |
19 | 0.445 | 0.305 | 0.235 |
20 | 0.480 | 0.305 | 0.215 |
If the number of selected variables less than four, we assumed that 4 variables are selected by the LASSO.
Three outlier models with
Selected variables by the LASSO fit in the brain aging data based the original sample (
Deleted observation | Selected variables | # of selected variables |
---|---|---|
Original data | 36 60 73 103 123 127 140 160 180 182 195 217 238 243 244 262 268 297 318 336 346 361 362 363 389 | 25 |
1 | 60 71 73 83 103 123 127 134 140 148 160 180 182 207 217 243 262 268 297 308 318 322 325 336 339 361 389 | 27 |
2 | 36 60 73 103 123 127 140 160 180 182 195 216 217 243 244 262 268 297 318 322 336 346 362 363 376 389 400 | 27 |
3 | 36 60 73 103 123 127 140 160 180 182 195 227 243 244 262 268 283 297 315 318 336 346 361 362 363 389 | 26 |
4 | 36 73 82 83 103 123 124 127 140 160 180 182 195 207 239 244 262 268 297 336 361 363 369 374 389 | 25 |
5 | 57 59 71 140 141 148 239 298 305 371 | 10 |
6 | 17 36 73 103 123 124 127 140 160 182 238 243 244 262 268 297 318 336 346 362 363 364 389 | 23 |
7 | 36 60 73 103 123 127 140 160 180 182 195 243 244 262 268 297 318 336 346 361 362 363 376 389 | 24 |
8 | 57 59 140 141 148 239 298 305 371 | 9 |
9 | 57 59 63 141 239 298 301 348 355 361 | 10 |
10 | 36 60 73 103 123 127 140 160 180 182 195 238 243 244 262 268 297 318 336 346 361 362 363 389 | 24 |
11 | 36 43 73 83 103 126 130 138 139 140 160 182 216 243 245 262 268 291 297 315 318 361 362 389 | 25 |
12 | 36 60 73 75 103 123 127 140 160 180 182 195 217 238 243 244 262 268 297 318 325 336 346 361 363 376 389 | 27 |
13 | 5 73 83 103 123 127 140 148 160 180 182 207 217 244 262 268 274 297 336 346 361 362 363 389 | 24 |
14 | 36 54 60 71 73 103 123 124 127 140 148 160 182 195 217 243 244 262 268 297 315 318 336 362 363 364 389 400 | 28 |
15 | 36 54 60 73 103 123 127 140 160 174 180 182 195 217 243 244 262 268 297 318 336 346 362 363 389 | 25 |
16 | 36 73 83 103 123 127 160 166 182 238 244 262 297 336 346 362 363 389 | 18 |
17 | 36 60 73 103 123 127 140 160 180 182 195 217 243 244 262 268 297 318 336 346 361 362 363 389 | 24 |
18 | 17 36 73 75 103 123 124 127 140 148 160 174 182 217 238 244 262 268 336 347 361 362 363 389 400 | 25 |
19 | 49 59 105 123 140 148 166 225 239 301 305 364 377 | 13 |
20 | 17 36 60 73 82 103 123 127 140 160 180 182 195 212 217 239 243 244 262 268 281 283 322 336 346 362 363 369 389 | 29 |
21 | 36 60 73 103 123 127 140 160 180 182 195 217 238 243 244 262 268 297 318 336 346 361 362 363 376 389 | 26 |
22 | 36 60 73 103 123 127 140 160 180 182 195 238 243 244 262 268 297 318 336 346 362 363 389 | 23 |
23 | 5 73 123 124 127 140 148 180 182 238 243 244 262 274 297 308 318 336 346 361 362 363 389 | 23 |
24 | 36 60 73 103 123 127 140 160 180 182 195 216 217 243 244 262 268 297 318 336 346 361 362 363 389 | 25 |
25 | 5 36 54 73 75 82 103 123 127 140 160 182 216 217 243 244 262 268 297 332 334 336 339 346 355 362 363 389 | 28 |
26 | 36 73 103 123 124 127 140 160 167 182 185 217 238 243 244 262 268 270 297 308 336 346 364 374 389 396 | 26 |
27 | 36 73 103 123 127 140 151 160 166 182 238 239 244 262 268 286 297 315 336 346 361 362 363 364 389 | 25 |
28 | 17 36 73 111 123 127 148 160 213 238 244 262 268 297 318 326 334 336 362 363 389 | 21 |
29 | 36 60 73 103 123 127 140 160 180 182 207 243 244 262 268 297 315 318 336 361 362 363 369 389 | 24 |
30 | 36 73 103 123 127 140 160 180 182 217 243 244 262 268 297 318 325 336 363 389 | 20 |