TEXT SIZE

CrossRef (0)
Effect of outliers on the variable selection by the regularized regression

Junho Jeonga, and Choongrak Kim1,a

aDepartment of Statistics, Pusan National University, Korea
Correspondence to: Department of Statistics, Pusan National University, 2, Busandaehak-ro 63 beon-gil, Geumjeonggu, Busan 46241, Korea. E-mail: crkim@pusan.ac.kr
Received December 20, 2017; Revised January 29, 2018; Accepted March 9, 2018.
Abstract

Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the “large n, small p” setup; however, diagnostic issues in the regression models have been rarely studied in a high dimensional setup. In the high dimensional data, the influence of observations is more serious because the sample size n is significantly less than the number variables p. Here, we investigate the influence of observations on the least absolute shrinkage and selection operator (LASSO) estimates, suggested by Tibshirani (Journal of the Royal Statistical Society, Series B, 73, 273–282, 1996), and the influence of observations on selected variables by the LASSO in the high dimensional setup. We also derived an analytic expression for the influence of the k observation on LASSO estimates in simple linear regression. Numerical studies based on artificial data and real data are done for illustration. Numerical results showed that the influence of observations on the LASSO estimates and the selected variables by the LASSO in the high dimensional setup is more severe than that in the usual “large n, small p” setup.

Keywords : high-dimension, influential observation, LASSO, outlier, regularization
1. Introduction

Much work has been done in regression diagnostics since Cook’s distance (Cook, 1977) was introduced forty years ago. The concept of regression diagnostics, considered in the classical linear model, has been extended to the Box-Cox transformation model (Box and Cox, 1964), ridge regression model (Hoerl and Kennard, 1970), and nonparametric regression models such as spline smoothing model, local polynomial regression, semiparametric model, and varying coefficient model. All regression diagnostic results in various regression models are done under the assumption of “large n, small p”, i.e., the number of unknown parameters which are estimated is less than the number of samples in the data.

High-dimensional data (small n, large p) are very popular in the areas of information technology, bioinformatics, astronomy, and finance. Classical statistical inferences such as the least squares estimation in the linear model cannot be used in high-dimensional data. Recently, many methodological and computational advances have allowed high-dimensional data to be efficiently analyzed; in addition, least absolute shrinkage and selection operator (LASSO), as introduced by Tibshirani (1996), remains an important statistical tools for high-dimensional data.

In this paper, we study diagnostic issues in the LASSO regression model for high-dimensional data. Kim et al. (2015) recently derived an approximate version of Cook’s distance in the LASSO regression; however, it is based on the “large n, small p” assumption. Cook’s distance in the LASSO model suggested by Kim et al. (2015) cannot be directly used in high-dimensional data because the covariance estimator of the LASSO estimator, which is necessary in defining a version of Cook’s distance, is not easily derived in the high-dimensional data. Further, under high dimensional setup, we are more interested in influential observations on the variable selection rather than the influence on estimators. Studies on the diagnostic measures for the LASSO model in the high-dimensional data are relatively few. Among them, Zhao et al. (2013) proposed an influence measure for marginal correlations between the response and all the predictors, and Jang and Anderson-Cook (2017) suggested influence plots for LASSO. In this paper, we focus the influence of one or few observations on the variable selection by the LASSO using the deletion method. The influence on the variable selection in the classical model via the least squares was studied by Bae et al. (2017), and the selection of a smoothing parameter in the robust LASSO was done by Kim and Lee (2017).

In high-dimensional data, the influence of one or few observations on some estimators could be more serious and important because the number of observations is small compared to the “large n, small p” setup. We investigate that a variable selection result based on the LASSO regression can be significantly different if one or few observations are deleted. LASSO estimates often do not have a analytic form; therefore, we assume the design matrix is orthogonal.

This paper is organized as follows. In Section 2, the difference between LASSO estimates based on the full samples and the partial samples after deleting some observations, respectively, is derived under simple setup of the design. Numerical studies based on artificial data sets are done in Section 3, and an illustrative example based on a real data set is given in Section 4. Finally, concluding remarks are given in Section 5.

2. Case influence diagnostics in LASSO

### 2.1. LASSO estimator based on partial samples

Consider a simple linear regression model with no intercept, i.e.,

$yi=βxi+ɛi, i=1,…,n,$

where ∑ xi = 0, $∑xi2=1$, and ∑ yi = 0. Then, it can be shown in Tibshirani (1996), for example, that the LASSO estimator of β under model (2.1) is

$β^L(λ)=sgn (β^)(|β^|-λ)+,$

where sgn(x) denotes the sign of x and β̂= ∑ xiyi is the least squares estimator (LSE) of β. Now, let K = {i1, i2, . . . , ik} be an index set of size k, and let β̂L(K)(λ) be a LASSO estimator of β based on (nk) observations after deleting k observations in K. Then, we have the following result.

Proposition 1

Under model (2.1),

$β^L(K)(λ)=sgn(β^-∑i∈Kxiyi)(1-∑i∈Kxi2)-1 (|β^-∑i∈Kxiyi|-λ)+.$
Proof

Note that β̂L(K)(λ) = arg minβ f (β), where f (β) = (1/2).jK(yjβxj)2 + λ|β|. Since

$f(β)=12∑j=1n(yj-βxj)2-12∑i∈K(yi-βxi)2+λ∣β∣=12∑j=1n(yj-β^xj)2+12(β^-β)2-12∑i∈K(yi-βxi)2+λ∣β∣.$

Now, by the first derivative of f (β) with respect to β, we have

$f′(β)=-β^+β+∑i∈Kxi(yi-βxi)+λ·sgn(β).$

Therefore, by noting x = sgn(x) |x|, we have

$β^=(1-∑i∈Kxi2)β+∑i∈Kxiyi+λ·sgn(β)=sgn(β)∣β∣(1-∑i∈Kxi2)+∑i∈Kxiyi+λ·sgn(β)=sgn(β) {(1-∑i∈Kxi2)∣β∣+ λ}+∑i∈Kxiyi,$

i.e.,

$β^-∑i∈Kxiyi=sgn(β) {(1-∑i∈Kxi2)∣β∣+λ}.$

Since the second term of Equation (2.2) is positive, we must have sgn(β̂ – ∑iK xiyi) = sgn(β). Now,

$(1-∑i∈Kxi2)β=β^-∑i∈Kxiyi-λ·sgn(β)=sgn(β^-∑i∈Kxiyi)|β^-∑i∈Kxiyi|-λ·sgn(β)=sgn(β^-∑i∈Kxiyi)(|β^-∑i∈Kxiyi|-λ)=sgn(β^-∑i∈Kxiyi)(|β^-∑i∈Kxiyi|-λ)+.$

Therefore,

$β^L(K)(λ)=sgn(β^-∑i∈Kxiyi)(1-∑i∈Kxi2)-1 (|β^-∑i∈Kxiyi|-λ)+$

which completes the proof.

Remark 1

As a simple consequence of Proposition 1, the LASSO estimator based on (n – 1) observations after deleting the ith observation is

$β^L(i)(λ)=sgn(β^-xiyi)(1-xi2)-1 (|β^-xiyi|-λ)+.$
2.2. Case influence in LASSO

### Proposition 2

Without loss of generality, we assume that β̂ > 0. Then, β̂L(λ) = (β̂λ)+. Now, we further assume that β̂ ≥ ∑iK xiyi ≥ 0. Then, it is easy to show that

$β^L(K)(λ)=(∑i∉Kxi2)-1 (β^L(λ)-∑i∈Kxiyi).$

To see the relationship between β̂L(λ) and β̂L(K)(λ) in terms of λ (seeFigure 1), we consider for the single case deletion i.e., K = {i}. First, if λβ̂, then β̂L(λ) = β̂L(i)(λ) = 0. Second, if λβ̂xiyi, then$β^L(i)(λ)=(β^L(λ)-xiyi)/∑j≠ixj2$. Finally, if β̂xiyi < λ < β̂, then β̂L(λ) = β̂λ while β̂L(i)(λ) = 0, i.e., if β̂xiyi < λ < β̂, then the deletion of the ith observation results in not selecting the covariate. Therefore, if λ is chosen in this range, the ith observation is said to be influential on the feature selection.

3. Numerical studies

### 3.1. Influence on coefficient estimates

Consider a simple linear regression model

$yi=βxi+ɛi, i=1,…,n,$

where ∑ xi = 0, $∑xi2=1$, and ∑ yi = 0. We generate n = 20 random numbers by the following steps.

• Step 1: Generate xi from N(0, 1)

• Step 2: Generate ɛi from N(0, 0.12)

• Step 3: Let Yi = 0.1xi + ɛi

• Step 4: Do centering and scaling to meet Σ xi = 0, $∑xi2=1$, and ∑ yi = 0.

The LSE based on the artificial data is β̂ = 0.346. When the 20th observation is deleted, β̂x20y20 = 0.179. Therefore, if 0.179 < λ < 0.346, β̂L(λ) = 0.346 – λ; however, β̂L(20)(λ) = 0.

### 3.2. Influence on variable selections

In this numerical study, we investigate three important aspects of the influence of the ith observation on the variable selection in the “small n, large p” setup. First, we want to see that the number of selected variables via the LASSO is very sensitive to the deletion of one observation. The number of selected variables has very important implication in the sense of the prediction and the estimation of the degrees of freedom. Second, we want to see the variable selection performance of the LASSO in the outlier model (a model with one outlying observation) when the tuning parameter is given so that the LASSO selects the true number of variables. Third, we want to see how correctly the LASSO selects variables in the outlier model when the tuning parameter is estimated by the cross-validation. To do this, we generate random numbers from the model given as follows.

Consider a linear regression model

$yi=xi′β+ɛi, i=1,…,n,$

where β = (5, 5, 5, 5, 0, . . . , 0)t is a 100-dimensional vector. We generate n = 20 random numbers by the following steps.

• Step 1: Generate each Xj, j = 1, . . . , 100 from N(5, 1).

• Step 2: Generate ɛi from N(0, 0.12).

• Step 3: Let $Yi=xi′β+ɛi$.

• Simulation (I) - Sensitivity of LASSO to a single observation

In this simulation study, we want to see that the number of selected variables via the LASSO is very sensitive to the deletion of one observation. Table 1 shows the LASSO estimators based on the full samples (i.e., n = 20) selected variables 1, 2, 3, and 4, which are true nonzero variables. However, the LASSO estimators based on 19 observations after deleting the ith (i = 1, 2, . . . , 20) observation show different selection results. For example, if we delete the 2nd observation, the LASSO selected 10 variables; in addition, the LASSO selected just one variable if we deleted the 16th observation.

• Simulation (II) - Sensitivity of LASSO to a single outlier (λ : given)

Here, we want to see the variable selection performance of the LASSO in the outlier model (a model with one outlying observation) when the tuning parameter is given so that the LASSO selects the 4 variables, i.e., true number of variables. We considered three outlier models, where each model contains outlier with y1 + c, c = 10, 30, 50. Table 2 shows the average proportion of the number of correctly selected variables among 4 variables selected by the LASSO out of 100 replications. Table 2 also shows the shows that the proportion of selecting a true model is about 0.5 when we delete the 1st observation (outlier); however, the proportion is less than 0.3 either when we use the full data or when we delete the other observation than the outlier.

• Simulation (III) - Sensitivity of LASSO to a single outlier (λ : estimated)

Finally, we want to see the variable selection performance of the LASSO in the outlier model when the tuning parameter is estimated by the cross-validation, for example. To do this we compute the average proportion of the number of the selected variables by LASSO out of 100 replications. If the number of selected variables less than four, we assumed that 4 variables are selected by the LASSO. Table 3 shows the average proportion of the number of correctly selected variables. We considered three outlier models, where each model contains an outlier with y1 + c, c = 10, 30, 50. Table 3 also indicates the proportion of selecting true model is above 0.5 when we delete the 1st observation (outlier), but the proportion is less than 0.5 either when we use the full data or when we delete other observation than the outlier.

Remark 2

Even though the formula for the multiple cases deletion was given in Proposition 1, the simulation study was done for a single case deletion only. Due to different aspects of multiple cases deletion (swamping phenomenon and masking effect), further simulation studies for multiple cases deletion would be helpful.

4. Example

As an illustrative example for the influence of observations on the variable selection in LASSO, we use the brain aging data of Lu et al. (2004). This data set contains measurements of p = 403 genes and n = 30 human brain samples, and the response is the age of each human. We fit this data set by the LASSO based on the original sample (n = 30), and fit based on n = 29 observations after deleting the ith (i = 1, . . . , 30) observation to see the influence of the ith observation on the variable selection by the LASSO.

Table 4 shows that the LASSO selects 25 variables among 403 variables. If we delete one observation, then most cases result in selecting around 25 variables except the cases of deleting the 8th, 9th, and 19th observation, respectively. We may conclude that those three observations are quite influential as far as the number of variable selection is concerned.

5. Concluding remarks

One or few observations could be very influential on estimators in “small p, large n” case, and this phenomenon becomes more serious in “small n, large p” case. In this paper, we investigate the influence of observations on the LASSO estimates and the selected variables by the LASSO. Also, we derived analytic expression for the influence of the ith observation on the LASSO estimates in the simple linear regression. Simulation results show that the influence of an outlier is more serious in the high dimensional case than in the low dimensional case.

For further studies, it will be worth studying the basic building blocks which affect variable selection results. Despite difficulties, it is also necessary to modify the LASSO model to a robust LASSO that is not sensitive to outliers.

Acknowledgments

This work was supported by a 2-year Research Grant of Pusan National University.

Figures
Fig. 1. β̂L(λ) and β̂L(20)(λ) in terms of λ.
TABLES

### Table 1

Selected variables by the LASSO with the full data set (n = 20) and 19 observations after deleting one observation, respectively, based on the artificial data

Deleted observationSelected variablesNumber of selected variables
Original data1 2 3 44
11 2 3 4 52 966
21 2 3 4 10 22 42 48 68 9910
31 2 3 4 48 626
41 2 3 44
51 2 3 44
61 2 3 4 765
71 2 3 44
81 2 3 4 105
91 2 3 44
101 2 3 4 10 426
111 2 3 4 485
121 2 3 4 485
131 2 3 4 36 486
141 2 3 4 10 46 48 66 969
151 2 3 4 365
1641
171 2 3 44
181 2 3 4 36 48 62 768
191 2 3 4 485
201 2 3 44

### Table 2

The average proportion of the number of correctly selected variables among 4 variables selected by the LASSO out of 100 replications

Deleted observationy1 + 10y1 + 30y1 + 50
None0.2940.1860.120
10.5120.4920.441
20.2960.1680.112
30.2840.1550.109
40.2720.1730.108
50.2830.1700.104
60.3040.1740.097
70.2740.1710.104
80.2830.1770.109
90.2920.1570.109
100.2950.1710.102
110.2800.1480.102
120.2990.1700.095
130.2740.1720.104
140.2830.1650.104
150.2900.1810.093
160.2960.1890.100
170.2810.1600.096
180.2870.1720.104
190.2900.1710.093
200.2910.1850.110

Three outlier models with y1 + c, c = 10, 30, 50 are considered.

### Table 3

The average proportion of the number of correctly selected variables by the LASSO out of 100 replications when the tuning parameter is estimated by the cross-validation

Deleted observationy1 + 10y1 + 30y1 + 50
None0.4900.3500.220
10.5500.5050.530
20.4800.2800.210
30.4750.3300.200
40.4650.3300.225
50.4500.3100.230
60.4800.2950.215
70.4800.3250.220
80.4650.3200.200
90.4550.3250.190
100.4850.3000.225
110.4700.2950.215
120.4650.3000.235
130.4500.3000.210
140.4750.3100.195
150.4500.2750.240
160.4900.3150.190
170.4800.3050.235
180.4600.3100.210
190.4450.3050.235
200.4800.3050.215

If the number of selected variables less than four, we assumed that 4 variables are selected by the LASSO.

Three outlier models with y1 + c, c = 10, 30, 50 are considered.

### Table 4

Selected variables by the LASSO fit in the brain aging data based the original sample (n = 30) and fit based on n = 29 observations after deleting the ith observation

Deleted observationSelected variables# of selected variables
Original data36 60 73 103 123 127 140 160 180 182 195 217 238 243 244 262 268 297 318 336 346 361 362 363 38925
160 71 73 83 103 123 127 134 140 148 160 180 182 207 217 243 262 268 297 308 318 322 325 336 339 361 38927
236 60 73 103 123 127 140 160 180 182 195 216 217 243 244 262 268 297 318 322 336 346 362 363 376 389 40027
336 60 73 103 123 127 140 160 180 182 195 227 243 244 262 268 283 297 315 318 336 346 361 362 363 38926
436 73 82 83 103 123 124 127 140 160 180 182 195 207 239 244 262 268 297 336 361 363 369 374 38925
557 59 71 140 141 148 239 298 305 37110
617 36 73 103 123 124 127 140 160 182 238 243 244 262 268 297 318 336 346 362 363 364 38923
736 60 73 103 123 127 140 160 180 182 195 243 244 262 268 297 318 336 346 361 362 363 376 38924
857 59 140 141 148 239 298 305 3719
957 59 63 141 239 298 301 348 355 36110
1036 60 73 103 123 127 140 160 180 182 195 238 243 244 262 268 297 318 336 346 361 362 363 38924
1136 43 73 83 103 126 130 138 139 140 160 182 216 243 245 262 268 291 297 315 318 361 362 38925
1236 60 73 75 103 123 127 140 160 180 182 195 217 238 243 244 262 268 297 318 325 336 346 361 363 376 38927
135 73 83 103 123 127 140 148 160 180 182 207 217 244 262 268 274 297 336 346 361 362 363 38924
1436 54 60 71 73 103 123 124 127 140 148 160 182 195 217 243 244 262 268 297 315 318 336 362 363 364 389 40028
1536 54 60 73 103 123 127 140 160 174 180 182 195 217 243 244 262 268 297 318 336 346 362 363 38925
1636 73 83 103 123 127 160 166 182 238 244 262 297 336 346 362 363 38918
1736 60 73 103 123 127 140 160 180 182 195 217 243 244 262 268 297 318 336 346 361 362 363 38924
1817 36 73 75 103 123 124 127 140 148 160 174 182 217 238 244 262 268 336 347 361 362 363 389 40025
1949 59 105 123 140 148 166 225 239 301 305 364 37713
2017 36 60 73 82 103 123 127 140 160 180 182 195 212 217 239 243 244 262 268 281 283 322 336 346 362 363 369 38929
2136 60 73 103 123 127 140 160 180 182 195 217 238 243 244 262 268 297 318 336 346 361 362 363 376 38926
2236 60 73 103 123 127 140 160 180 182 195 238 243 244 262 268 297 318 336 346 362 363 38923
235 73 123 124 127 140 148 180 182 238 243 244 262 274 297 308 318 336 346 361 362 363 38923
2436 60 73 103 123 127 140 160 180 182 195 216 217 243 244 262 268 297 318 336 346 361 362 363 38925
255 36 54 73 75 82 103 123 127 140 160 182 216 217 243 244 262 268 297 332 334 336 339 346 355 362 363 38928
2636 73 103 123 124 127 140 160 167 182 185 217 238 243 244 262 268 270 297 308 336 346 364 374 389 39626
2736 73 103 123 127 140 151 160 166 182 238 239 244 262 268 286 297 315 336 346 361 362 363 364 38925
2817 36 73 111 123 127 148 160 213 238 244 262 268 297 318 326 334 336 362 363 38921
2936 60 73 103 123 127 140 160 180 182 207 243 244 262 268 297 315 318 336 361 362 363 369 38924
3036 73 103 123 127 140 160 180 182 217 243 244 262 268 297 318 325 336 363 38920

References
1. Bae, W, Noh, S, and Kim, C (2017). Case influence diagnostics for the significance of the linear regression model. Communications for Statistical Applications and Methods. 24, 155-162.
2. Box, GEP, and Cox, DR (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B. 26, 211-252.
3. Cook, RD (1977). Detection of influential observation in linear regression. Technometrics. 19, 15-18.
4. Hoerl, AE, and Kennard, RW (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 12, 55-67.
5. Jang, DH, and Anderson-Cook, CM (2017). Influence plots for LASSO. Quality and Reliability in Engineering International. 33, 1317-1326.
6. Kim, C, Lee, J, Yang, H, and Bae, W (2015). Case influence diagnostics in the lasso regression. Journal of the Korean Statistical Society. 44, 271-279.
7. Kim, J, and Lee, S (2017). A convenient approach for penalty parameter selection in robust lasso regression. Communications for Statistical Applications and Methods. 24, 651-662.
8. Lu, T, Pan, Y, Kao, SY, Kohane, I, and Chan, J (2004). Gene regulation and DNA damage in the ageing human brain. Nature. 429, 883-891.
9. Tibshirani, R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 58, 267-288.
10. Zhao, J, Leng, C, Li, L, and Wang, H (2013). High-dimensional influence measure. The Annals of Statistics. 41, 2639-2667.