TEXT SIZE

CrossRef (0)
A convenient approach for penalty parameter selection in robust lasso regression

Jongyoung Kima, and Seokho Lee1,a

aDepartment of Statistics, Hankuk University of Foreign Studies, Korea
Correspondence to: 1Corresponding author: Department of Statistics, Hankuk University of Foreign Studies, 50 Oedae-ro 54beon-gil, Mohyeon-myeon, Cheoin-gu, Yongin 17035, Korea. E-mail: lees@hufs.ac.kr
Received July 21, 2017; Revised September 20, 2017; Accepted October 14, 2017.
Abstract

We propose an alternative procedure to select penalty parameter in L1 penalized robust regression. This procedure is based on marginalization of prior distribution over the penalty parameter. Thus, resulting objective function does not include the penalty parameter due to marginalizing it out. In addition, its estimating algorithm automatically chooses a penalty parameter using the previous estimate of regression coefficients. The proposed approach bypasses cross validation as well as saves computing time. Variable-wise penalization also performs best in prediction and variable selection perspectives. Numerical studies using simulation data demonstrate the performance of our proposals. The proposed methods are applied to Boston housing data. Through simulation study and real data application we demonstrate that our proposals are competitive to or much better than cross-validation in prediction, variable selection, and computing time perspectives.

Keywords : adaptive lasso, cross validation, lasso, robust regression, variable selection
1. Introduction

Regularized regression is popularly used to incorporate various assumptions that the model should require (Bishop, 2006; Hastie et al., 2001; Murphy, 2012). Type of regularization depends on a specific model assumption. For example, in functional regression, the coefficient parameter is assumed to be a smooth function. In sparse regression, some coefficient parameters corresponding to insignificant variables are expected to be zero. To incorporate smoothness and sparsity on parameter estimate, the regularization technique is often used. Such regularization is achieved under a penalized loss minimization framework. Consider a regression model $yi=β0+xiTβ+ɛi$ with a training data = {(xi, yi)|xi ∈ ℝp, yi ∈ ℝ1, i = 1, 2, …, n} and error variance var(εi) = σ2. The objective function to be minimized is a penalized loss:

$ℓ(β0,β)=∑i=1nρ(ri)+P(β;λ),$

where $ri=(yi-β0-xiTβ)/σ$. Squared loss function ρ(u) = u2 is used in regression. In presence of outliers in training data, robust loss functions, Huber loss or bisquare loss for example, are often used to circumvent harmful effects from outliers. Penalty function P(· ; ·) is chosen according to the purpose of regularization. Roughness penalty is used for a smooth function estimate and sparsity-inducing penalty for sparse estimate. Regularization and goodness-of-fit are balanced at an optimal penalty parameter λ. The optimal λ is chosen for the best prediction. In absence of test data, cross validation (CV) is commonly used for test mean squared errors (MSE) estimation in regression. CV is a generic model selection criterion which can be applied to many predictive models where the response variable is based on even no concrete probabilistic ground. Even with its versatility, CV suffers from heavy computational burden.

In this research, we propose an alternative approach to CV in robust regression with L1 penalization. This approach is based on marginalizing prior distribution over the penalty parameter. Thus, resulting objective function does not include the penalty parameter due to marginalizing it out. But its estimating equation automatically sets a tentative penalty parameter, and we can use it as the penalty parameter in the penalized regression estimating procedure. This idea is straightforwardly generalized to variable-wise penalization, which is often prohibitive through CV.

This paper is organized as follows. In Section 2, we introduce robust lasso regression and provide Bayesian approach to marginalizing penalty parameter and its estimating algorithm. Variable-wise penalization is introduced in the same section. The performance of our proposals are demonstrated through simulation studies and Boston housing data in Section 3. The paper ends with some remarks in Section 4.

2. Methodology

### 2.1. Robust lasso regression

Consider a linear model $yi=β0+xiTβ+ɛi$ for i = 1, …, n. In the presence of outlying observations among the training dataset, robust loss is used to reduce the outlier effect in regression (Maronna et al., 2006). When we impose some regularization on the coefficient β ∈ ℝp, we add a penalty function on β to the empirical loss and minimize the penalized empirical loss (1.1) to find a robust coefficient estimate. Well-known robust loss functions are Huber loss,

$ρH(u∣k)={u2,if ∣u∣ ≤k,2k∣u∣-k2,if ∣u∣ >k,$

and Tukey’s bisquare loss,

$ρB(u∣k)={1-{1-(uk)2}3,if ∣u∣ ≤k,1,if ∣u∣ >k.$

Both loss functions include the additional parameter k, which regulates robustness and efficiency of resulting coefficient estimate. Throughout this study, we use k = 1.345 for Huber loss and k = 4.685 for bisquare loss for 95% efficiency under normality (Maronna et al., 2006).

As in traditional linear regression, we can apply regularization techniques in robust regression (El Ghaoui and Lebret, 1997; Owen, 2006). For variable selection as well as prediction enhancement, L1 penalty $P(β;λ)=λ‖β‖1=λ∑j=1p∣βj∣$ can be imposed. It becomes lasso if we use the square loss ρ(u) = u2. Thus, robust lasso regression is done by minimizing the penalized objective function (1.1). A usual way for minimizing (1.1) is to use Newton-Raphson algorithm for M-estimates in robust statistics. To look at this more closely, we take a derivative of (1.1) with respect to β0 and β. Then, normal equations become

$∂ℓ∂β0=-∑i=1nwiriσ=0, ∂ℓ∂β=-∑i=1nwirixiσ+λ∂‖β‖1∂β=0$

with wi = ψ(ri)/ri and ψ = ρ′. The above normal equations are exactly same with those from the below weighted lasso problem:

$Q(β0,β)=∑i=1nwiri2+λ∑j=1p∣βj∣.$

Thus, robust lasso regression (1.1) with robust loss and L1 penalty can be solved by iteratively optimizing weighted lasso (2.2), where weight wi is updated at every iteration step. In robust linear regression, scale parameter σ should be estimated in robust fashion. A common way to obtain robust scale is the normalized median absolute deviates (MADN) of residuals from robust regression (Maronna et al., 2006). MADN is defined as MADN(x) = median(|x – median(x)|)/0.675. In this study, we initially fit L1 regression estimate (least absolute deviation (LAD) estimate; Koenker, 2005). Finally, we obtain σ estimate as MADN of residuals from LAD fit.

To choose an optimal regularization parameter λ, CV is a popular choice. In CV procedure, training data is split into several exclusive pieces. Each piece works as a validation set and remaining pieces are used to fit the model. The fitted model is evaluated on the validation set by MSE. This step is cycled until all pieces serve as a validation set. After finishing the whole cycle, CV score is defined as the averaged MSEs and the regularization parameter λ is chosen to minimize CV score. In the presence of outliers in the training data, MSE based CV score is not reliable because of the possible presence of outliers in the validation set. Thus, robust loss for errors is used in CV score computation, instead of squared errors in the traditional MSE. Generalized cross validation (GCV) is an convenient selection criterion which does not require an onerous splitting-and-fitting procedure as in CV. After fitting the model to whole training data, GCV is obtained using hat matrix Hλ where ŷ = Hλy. However, GCV is not available in robust regression because the regression coefficient estimate is not given as a linear combination of response variable in most robust regression. Information criterion, including Bayesian information criterion (BIC), is a frequently used for model selection. BIC as well as most other information criterion depend on a data model because data distribution specifies the likelihood as a part of the criterion. In robust regression it is difficult to assume data distribution due to the existence of outliers.

### 2.2. Bayesian approach by marginalizing regularization parameter

In this section, we introduce a convenient way to choose a penalty parameter under a Bayesian framework. The basic idea is to use marginal distribution of β by marginalizing out the penalty parameter. Buntine and Weigend (1991) introduce this idea first in penalized regression and logistic regression problems. Here, we use the same idea in a robust lasso regression. Penalty term P(β; λ) in (1.1) can be regarded as the negative logarithm of prior distribution for β. For L1 penalization, the prior over β is a joint distribution of independent Laplace random variables

$f(β∣λ)=(λ2)pexp{-λ‖β‖1}=∏j=1pλ2exp{-λ∣βj∣}.$

The prior of β depends on λ as a scale parameter. To remove the dependency on λ, Buntine and Weigend (1991) impose an improper Jeffrey prior on the hyperparameter λ and marginalize it out. Using Jeffrey prior for λ, f (λ) ∝ 1/λ, the marginal prior on β becomes

$f(β)=∫0∞f(β∣λ)f(λ)dλ=∫0∞λp-12pexp{λ‖β‖1}dλ=Γ(p)2p‖β‖1p.$

By replacing the original penalty induced from – log f (β|λ) in (1.1) by a new penalty from – log f (β), we can obtain a penalized loss ℓ̃:

$ℓ˜(β0,β)=∏i=1nρ(ri)+plog‖β‖1.$

Equation (2.3) is a new objective function without λ. For β estimation, normal equations from (2.3) become

$∂ℓ˜∂β0=-∑i=1nwiriσ=0, ∂ℓ˜∂β=-∑i=1nwirixiσ+p‖β‖1∂‖β‖1∂β=0.$

The above normal equation cannot be solved analytically. Note that, by comparing (2.4) with (2.1) from the original objective function, the penalty parameter λ in (2.1) is replaced by the term p/||β||1 in (2.4). Thus, Buntine and Weigend (1991) suggest an iterative fitting procedure for a penalized least square problems, where (β0, β̂) is obtained as a weighted penalized least squares solution with a penalty parameter $λ=p/‖βo‖1=p/∑j=1p∣βjo∣$ using the previous solution $βo=(β1o,…,βpo)T$. We can use the same idea in robust lasso problem by solving iteratively (2.2). Updating formula for (β0, β) in (2.2) are obtained as a weighted lasso problem using weights $wi=ψ(rio)/rio$ and penalty $λ=p/∑j=1p∣βjo∣$, where $rio=yi-β^0o-xiTβo$.

Through some simulation studies, we observed that this approach produces reliable performance in prediction but is not satisfactory in variable selection. To enhance variable selection performance, we consider a different penalty function for β. Instead of the common penalty parameter λ for all βj, we use separate penalty parameters for each variable, in the similar to adaptive lasso (Zou, 2006). Thus, L1 penalty function $λ∑j=1p∣βj∣$ in (1.1) is replaced by $P(β;λ)=∑j=1pλj∣βj∣$ with λ = (λ1, …, λp)T. We call the former “robust lasso” and the latter “robust adaptive lasso.” This change causes computational challenge in CV because p penalty parameters should be chosen by p-dimensional grid search. This is infeasible in practice even with a moderate size of variables. However, the same Bayesian approach can be easily implemented. With variable-wise penalty parameters, the objective function becomes

$ℓ(β0,β)=∑i=1nρ(ri)+∑j=1pλj∣βj∣.$

The coefficient parameters βj ( j = 1, 2, …, p) are assumed to follow Laplace distribution p(βj|λj) = (λj/2) exp(–λj|βj|) conditional on λj. Similar to robust lasso in above, marginal distribution of βj with Jeffrey prior p(λj) ∝ 1/λj is obtained as

$f(βj)=∫0∞f(βj∣λj)f(λj)dλj=∫0∞12e-λj∣βj∣dλj=12∣βj∣.$

Penalty function on β becomes $-log f(β)=-log{∏j=1pf(βj)}=∑j=1plog∣βj∣+constant$. Thus, after marginalizing λj out, objective function (2.5) is modified as

$ℓ˜(β0,β)=∑i=1nρ(ri)+∑j=1plog∣βj∣.$

Normal equations of (2.5) are

$∂ℓ˜∂β0=-∑i=1nwiriσ=0, ∂ℓ˜∂βj=∑i=1nwirixijσ+1∣βj∣∂∣βj∣∂βj=0,$

for j = 1, 2, …, p. This normal equation is exactly same with that from ℓ in (2.5) if λj is set to be 1/|βj|. We can still here employ an iterative scheme of estimation. We summarize the whole procedure into the below conceptual algorithm, where we use coordinate descent algorithm to estimate β.

Robust lasso and robust adaptive lasso by marginalizing penalty parameters

 (Initialization) Obtain initial parameters $β^0o$, β̂o from LAD or LAD-ridge. Set σ̂ = MADN(ri) where ri are residuals from LAD. Repeat until convergence is met. Set $ri=(yi-β^0o-xiTβ^o)/σ^$ and wi = ψ(ri)/ri for i = 1, …, n. For robust lasso, set $λ=∣I∣/∑j∈I∣β^jo∣$ where ℐ is the index set of nonzero elements in β̂o and |ℐ| is the number of nonzeros in β̂o. For robust adaptive lasso, set $λj=1/∣β^jo∣$. If $β^jo=0$, then λj = ∞. Compute $y¯w=∑i=1nwiyi/∑i=1nwi$ and $x¯jw=∑i=1nwixij/∑i=1nwi$ for all j = 1, 2, …, p. Then, syet ỹi = yi − ȳw and $x˜ij=xij-x¯jw$. For j = 1, 2, …, p, compute yi(−j) and update β̂j by $yi(-j)=y˜i-∑l≠jx˜ilβ^lo, β^j=soft(∑i=1nwix˜ijyi(-j)∑i=1nwix˜ij2,σ^2λj*∑i=1nwix˜ij2),$ where soft(x, t) = sign(x)(|x| − t)+ and u+ = max(u, 0). Here, we set $λj*=λ$ for robust lasso and $λj*=λj$ for robust adaptive lasso. Update the intercept β̂0 by $β^0=y¯w-∑j=1pw¯jwβ^j$. If convergence is not achieved, update $β^0o←β^0$ and β̂o ← β̂.

### 2.3. Connection to existing methods

Robust adaptive lasso described in the previous section allows variable-wise penalty parameters for each coefficient. This reminds adaptive lasso where penalty term is given as $λ∑j=1pηj∣βj∣$ (Zou, 2006). Zou (2006) suggests ηj = 1/|β̂j|γ, where β̂j is a consistent estimator, least squares solution suggested, and γ is another tuning parameter. If λj = ληj and γ = 1, then λj = 1/|β̂j| which seems similar to the proposed approach. However, our approach keeps changing β̂o during iterations, while adaptive lasso from Zou (2006) fixes it as the least squares solution or ridge solution if n < p. Zou (2006) suggests to use CV for the choice of (λ, γ), which is computationally burdensome due to a 2-dimensional grid search. Automatic relevance determination (ARD) (MacKay, 1995; Tipping 2001) is another approach that considers variable-wse penalization too. Even though ARD is originally developed for sparse kernel regression, it is easily modified to a regular regression problem. ARD assumes Gaussianity for β, not Laplacianity in adaptive lasso, and achieves a variable selection by reducing down prior variance for negligible variables. Its estimation can be cast into penalized loss framework. In ARD, however, updating formula for λj depends all coefficient parameter estimates in the previous iteration and, thus, cannot be separable variable-wise. This causes relatively slow convergence and cannot exploit the simple thresholding scheme.

Our approach is closely related to a traditional Bayesian approach. If ρ(u) = u2, i.e., non-robust regression, lasso can be formulated in Bayesian framework using Gaussian scale mixture as $βj∣τj2,λ~N(0,τj2)$ and $τj2∣λ~gamma(1-λ2/2)$ (Park and Casella, 2008;West, 1987). Using normality on the response yi, full Bayesian approach allows Bayesian inference for lasso because the full posterior f is available. However, typical Markov chain Monte Carlo (MCMC) approximation does not often produce an exact zero solution. Contrast to MCMC, expectation-maximization (EM) algorithm allows sparse estimation under Bayesian formulation (Figueiredo, 2003). For either MCMC or EM, hyperparameter λ should be chosen by empirical Bayes or evidence procedure, where λ is chosen by the maximizer of called marginal likelihood or evidence. In robust lasso, however, is not available because Huber or biweight loss functions are not derived from any distribution. Penalty function in robust adaptive lasso can also be formulated under a Bayesian structure by assuming $βj∣τj2,λ~N(0,τj2),τj2/λj~gamma(1,λj2/2)$, and λj|a, b ~ inverse-gamma(a, b). It is called hierarchical adaptive lasso (HAL; Lee et al., 2010). Under non-robust situation, EM algorithm is similar to the proposed approach in the sense that M-step produces the same β estimation and E-step gives $E(λj)=(a+1)/(b+∣β^jo∣)$. If a = b = 0, then the expected value of λj is exactly same with the proposed approach. However, HAL is not appropriate for robust adaptive lasso either because HAL also assumes normality on the response. Another interesting connection can be found in log penalized regression in Zou and Li (2008). Zou and Li (2008) derived logarithm penalty as a limiting version of Bridge penalty with q → 0 using local linear approximation. The difference is that our approach resorts to full iteration while Zou and Li (2008) suggests one-step iteration only.

3. Numerical Studies

### 3.1. Synthetic data

Artificial data is used to test our proposal in this section. Input variables xi j (i = 1, 2, …, n; j = 1, 2, …, p) are independently generated from N(0, 1). The intercept parameter β0 is set to zero, and the first 10 slope parameters are generated uniformly on the positive interval from 1 to 2 and remaining slopes are set to zero. i.e., βj ~ uniform(1, 2) for j = 1, 2, …, 10 and βj = 0 for j = 11, …, p. Response variables yi are constructed by $yi=xiTβ+ɛi$ for i = 1, …, n. Thus, the first 10 variables are important in prediction and the remaining p – 10 variables are not. To mimic contamination in training data, random error εi are generated from the contaminated normal distribution

$ɛi~(1-π)N(0,σ2)+πN(mσ2,σ2).$

Here, π is a contamination rate. If π = 0, there is no outlier. Outliers come from normal distribution with shifted mean by m-factor of error variance. The error variance is set to have a five times signal-to-noise ratio in a standard deviation scale. i.e., $σ=sd(xiTβ)×0.2$. We used glmnet() and cv.glmnet() functions in glmnet package for CV. For non-robust case (π = 0), the direct use of cv.glmnet() is sufficient and efficient. For a robust case, weight wi should be updated at every iteration. Thus, we devise a loop for CV and use glmnet() function for model fitting inside the loop. We conducted 5-fold CV over 100 grid points as suggested in glmnet package. BIC is also considered for comparison study (Chang et al., 2017; Lambert-Lacroix and Zwald, 2011) and the same grid set used in CV was also used in BIC.

The first simulation considers n > p situation. We increase sample size n = 100, 1,000, 10,000 with a fixed p = 20. Proportion of outliers is π = 0, 0.1, 0.2, 0.3 and m = 5 for outlier generation. Test data of size 10,000 is additionally generated without outliers and used to measure prediction error. Prediction error is calculated as root mean square of error (RMSE). Simulation is repeated 100 times and we report the average and standard error of RMSEs in Table 1. We name square, Huber, bisquare losses by “S”, “H”, “B”, respectively. “BIC” for BIC, “L” for robust lasso, and “A” for robust adaptive lasso. Thus, “B.A” means robust adaptive lasso with bisquare loss. From Table 1, in absence of outliers (π = 0), traditional lasso using square loss under CV is best in prediction. However, as π increases, square loss is outperformed by robust loss. Bisquare loss is generally better than Huber loss, which demonstrates the well-known fact that nonconvex loss is better than convex loss in robust regression. Overall, robust adaptive lasso performs best among other competitors. We observe that CV and our proposals are hardly distinguishable in standard error scale, as the sample size increases. We compute a false negative rate (FN) and false positive rate (FP), and report their average in Table 2. FN is defined as a ratio of a falsely claimed zero, while they are actually not zero; however, FP is defined as a ratio of falsely claimed nonzero, while they are actually zero. FN is almost zero for all methods, except for a small sample size. But we observe that FP seems quite high. Note that low FN and high FP indicates that almost all variables are selected from a variable selection, implying a low selection ability. Considering this, the BIC selection seems to perform very well in a large sample case. Table 3 presents average computing time for each method. Robust lasso and robust adaptive lasso are enormously faster than CV while their prediction and variable selection performance is comparable or better than CV. Our proposals are much faster than CV and BIC in penalized robust regression.

We conduct the second simulation where variable size is larger than sample size. Sample size is fixed as n = 100 and variable size are set to be p = 50, 100, 200, 400. In this simulation, outliers are generated from mean mσ2 with a factor $m=p$. Thus, outliers locate further from typical data as dimension increases. Prediction performance, variable selection performance, and computing time are summarized in Tables 3, 4, and 5, respectively. Table 4 shows that bisquare loss produces reliable results in presence of outliers. Note that Huber loss quickly deteriorates as contamination get larger, demonstrating that convex Huber loss suffers from severe outliers. In this simulation CV often produces best results, but adaptive lasso version of Bayesian approach is still comparable. Most of all, computing time in Table 6 demonstrates that the latter is more attractive than the former. In Table 5 we provides false rate of slope estimate by combining false negatives and false positives, instead of reporting separate false rates as in Table 2. Thus, false rate is the proportion of elements in β̂ that true zero is falsely claimed to zero and true nonzero falsely claimed zero. Table 5 demonstrates that robust adaptive lasso using bisquare outperforms CV in variable selection perspective, while they are comparable in prediction sense. Note that CV usually finds the best model which gives the best prediction, while robust adaptive lasso focuses on variable selection as well as a good fit to the data in the high-dimensional cases.

### 3.2. Real data example

We applied our methods to Boston housing data, available on http://lib.stat.cmu.edu/datasets/boston. The dataset contains medv (median value of owner-occupied homes in thousand dollars) as a response variable and additional 13 explanatory variables: crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq. ft.), indus (proportion of non-retail business acres per town), chas (Charles river dummy variable), nox (nitric oxides concentration, parts per 10 million), rm (average number of rooms per dwelling), age (proportion of owner-occupied units built prior to 1940), dis (weighted distances to five Boston employment centers), rad (index of accessibility to radial highways), tax (full-valued property-tax rate per 10,000 dollars), ptratio (pupil-teacher ratio by town), black (1000 × (Bk – 0.63)2 where Bk is the proportion of blacks by town), and lstat (percent of lower status of the population). There are 506 observations in the dataset. We split the dataset into two parts: the first 300 observations were regarded as a training dataset and the remaining 206 observations were treated as a test dataset. To see the effect of outliers on the methods we considered, we randomly selected 30 observations from the training dataset and changed their response variables by values having large residuals (five times larger than residual standard deviation).

Tables 7 and 8 present the coefficient estimates and test RNSE from the fit of the training dataset without and with outliers, respectively. Table 7, in the case of absence of outliers, shows that all methods seem to produce the similar coefficient estimates and test RMSE. However, after outlier inclusion, square loss severely deteriorates, while Huber and biweight losses are robust against outlier. Interestingly, we observed that BIC is not stable for outlier in this example.

4. Conclusion and remarks

In this study, we propose an alternative way for CV in robust lasso and robust adaptive lasso using marginal prior on coefficients under Bayesian framework. Throughout simulation studies, we demonstrate that our proposals are competitive to or better than CV in prediction, variable selection, and computing time perspectives.

In this study we limit this idea only to robust regression; however, it can be further applied to various penalized predictive models. Our approach becomes especially valuable when CV is not appropriate. For example, in classification problem, training data may suffer from a mislabeling class label called label-noise (Lee et al., 2016). In such case, hold-out procedure like CV is not successful because the validation set also have a label-nose; in addition, a cross-validation score from a validation set having label-noise is not reliable for model selection. One can extend the same idea in this study to some complicated problems that include label-nose, where CV is not available. We leave this direction for future research.

TABLES

### Table 1

Simulation 1 - average of 100 test RMSEs and its standard error (in parenthesis)

nπRMSEs

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
1000.02.71672.73732.74442.74132.74572.75892.73522.77482.76962.80252.7723
(0.0250)(0.0280)(0.0230)(0.0260)(0.0280)(0.0240)(0.0230)(0.0270)(0.0280)(0.0250)(0.0250)

0.14.58795.34374.87522.86455.16002.90412.86702.76885.29522.78622.7486
(0.0540)(0.0580)(0.0720)(0.0290)(0.0560)(0.0270)(0.0270)(0.0280)(0.0450)(0.0260)(0.0260)

0.26.46166.81086.95283.48845.29033.57633.55012.78305.03192.79222.7468
(0.0820)(0.0660)(0.1150)(0.0490)(0.0600)(0.0470)(0.0500)(0.0300)(0.0680)(0.0280)(0.0290)

0.38.36578.54128.90196.80796.98487.54727.60946.15756.30426.93506.9094
(0.1160)(0.0970)(0.1570)(0.1810)(0.1890)(0.1690)(0.1700)(0.2590)(0.2500)(0.2670)(0.2650)

1,0000.02.54762.54702.53842.53492.53892.53882.53522.53442.53842.53872.5348
(0.0070)(0.0070)(0.0060)(0.0060)(0.0060)(0.0060)(0.0060)(0.0060)(0.0060)(0.0060)(0.0060)

0.13.59183.63373.61472.58582.62462.59082.58612.53692.57182.54002.5349
(0.0080)(0.0080)(0.0080)(0.0060)(0.0070)(0.0060)(0.0060)(0.0060)(0.0070)(0.0060)(0.0060)

0.25.47185.53185.49642.80592.89172.81462.80542.54012.58182.54242.5352
(0.0160)(0.0160)(0.0170)(0.0060)(0.0070)(0.0050)(0.0050)(0.0060)(0.0070)(0.0060)(0.0060)

0.37.58907.64137.61043.63604.18263.66813.65142.54292.70742.54472.5343
(0.0260)(0.0250)(0.0260)(0.0060)(0.0300)(0.0060)(0.0060)(0.0060)(0.0190)(0.0060)(0.0060)

10,0000.03.78733.67792.50222.50222.50332.50242.50212.50222.50322.50242.5020
(0.0050)(0.0050)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)

0.14.03984.01373.42802.53872.54332.53962.53902.50242.50542.50262.5021
(0.0030)(0.0030)(0.0030)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)

0.25.49095.47225.27312.72832.73962.73152.73052.50302.50702.50332.5026
(0.0040)(0.0040)(0.0050)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)

0.37.42677.42427.37773.45063.46733.45603.45422.50292.50892.50322.5021
(0.0070)(0.0070)(0.0080)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)(0.0020)

Best performer for each case is highlighted.

RMSE= root mean squared error; S = square; H= Huber; B = bisquare; CV= cross validation; BIC = Bayes information criterion; L = robust lasso; A= robust adaptive lasso.

### Table 2

Simulation 1 - average of false negative and false positive rates

nπFalse negative

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
1000.00.0000.0010.0000.0010.0010.0000.0010.0010.0010.0000.002
0.10.1740.7780.0040.0020.9060.0010.0030.0000.9590.0000.001
0.20.4570.9090.0030.0470.8240.0030.0360.0000.8070.0000.005
0.30.5440.9490.0360.1840.1970.0590.2940.1550.1840.0590.301

1,0000.00.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
0.10.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
0.20.0000.0050.0000.0000.0000.0000.0000.0000.0000.0000.000
0.30.0000.0160.0000.0000.1380.0000.0000.0000.0400.0000.000

10,0000.00.0160.0100.0000.0000.0000.0000.0000.0000.0000.0000.000
0.10.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
0.20.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
0.30.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000

nπFalse positive

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A

1000.00.6310.3640.9760.6300.3800.9660.4000.6420.4190.9540.390
0.10.4240.0460.9850.6160.0010.9520.4000.6480.0000.9350.307
0.20.2870.0130.9830.4960.0130.9350.4070.6270.0060.8830.183
0.30.2410.0110.9350.5870.5730.9000.4650.5820.5380.8520.379

1,0000.00.1970.1490.9910.6770.2400.9800.3200.6670.2530.9790.305
0.10.6290.2250.9960.6320.0200.9810.3390.6460.0040.9690.242
0.20.6490.1910.9930.6410.0170.9750.3670.6410.0000.9770.174
0.30.6260.1940.9900.6200.0380.9680.4320.6500.0020.9350.069

10,0000.00.0000.0000.9930.6060.1690.9930.2900.6040.1620.9930.275
0.10.0000.0000.9960.6170.0060.9910.3000.6090.0010.9910.219
0.20.0000.0000.9990.6040.0030.9930.3220.5980.0000.9920.151
0.30.0180.0160.9980.5730.0140.9890.3790.6180.0000.9760.045

S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust lasso; A = robust adaptive lasso.

### Table 3

Simulation 1 - average of 100 computing times (in seconds)

nπComputing time in seconds

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
1000.00.06400.14900.01754.46900.87430.03390.04766.75611.33910.04800.0559
0.10.06360.15020.01984.30140.83900.03270.04465.43191.05300.03490.0441
0.20.06500.15200.01874.61520.82930.04310.05554.26540.80060.03110.0360
0.30.06440.15320.02225.37851.11890.05600.06988.14091.45580.06170.0649

1,0000.00.08820.17480.03827.40411.66170.05700.11588.39731.88730.06620.1079
0.10.08630.17740.04227.09251.58690.06010.12057.89271.76250.06030.1130
0.20.08760.17940.03837.21561.60330.06660.12807.27071.64480.05970.1014
0.30.08450.17800.03767.33961.60760.08430.11846.75751.53030.07370.0873

10,0000.00.31100.46870.413998.242427.14240.59690.9334101.937427.93750.61200.9167
0.10.31350.48310.421493.183525.76400.60811.001095.623226.33300.67530.9977
0.20.31720.48870.411787.817124.23690.61950.926489.673524.73930.62080.8778
0.30.31670.49880.427981.463922.19430.69470.946580.768322.01250.52840.7187

S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust lasso; A = robust adaptive lasso.

### Table 4

Simulation 2 - average of 100 test RMSEs and its standard error (in parenthesis)

pπRMSEs

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
500.02.85722.90053.26082.89772.96043.26883.11083.06353.07403.50443.1815
(0.0250)(0.0280)(0.0290)(0.0280)(0.0310)(0.0300)(0.0280)(0.0330)(0.0340)(0.0370)(0.0320)

0.16.11766.175111.22813.15355.36294.56473.82343.04215.39623.44823.1631
(0.0720)(0.0540)(0.2730)(0.0340)(0.0310)(0.0950)(0.0560)(0.0350)(0.0380)(0.0400)(0.0320)

0.28.60248.468315.31064.17255.482013.46858.64593.01885.17443.32013.0847
(0.1410)(0.1230)(0.3640)(0.0670)(0.0400)(0.3700)(0.2630)(0.0290)(0.0570)(0.0340)(0.0300)

0.311.382011.280317.928010.368010.939118.796416.34896.34316.54759.37997.5900
(0.2100)(0.1990)(0.4310)(0.4600)(0.4930)(0.5100)(0.3870)(0.5120)(0.5030)(0.7030)(0.5120)

1000.05.33365.33885.33603.39723.50695.20863.62394.21704.33174.20893.4831
(0.0330)(0.0330)(0.0660)(0.0730)(0.0760)(0.0630)(0.0310)(0.1030)(0.1040)(0.0520)(0.0300)

0.17.02327.015941.78783.63065.396625.70047.79183.70665.43684.16593.5490
(0.0680)(0.0630)(0.9690)(0.0620)(0.0330)(0.5200)(0.2380)(0.0840)(0.0350)(0.0480)(0.0350)

0.210.751910.599255.52864.729810.462639.337421.23393.30095.32233.73443.3720
(0.1520)(0.1400)(1.3470)(0.0660)(1.5880)(0.7500)(0.4130)(0.0540)(0.0460)(0.0420)(0.0420)

0.315.082514.793258.554616.867682.217349.573528.19485.62705.61685.53685.4661
(0.2360)(0.2210)(1.6150)(1.0060)(5.0710)(1.0190)(0.5150)(0.7570)(0.5940)(0.6270)(0.5240)

2000.05.27925.29603.74083.31443.49333.75303.71573.93894.22463.73153.5408
(0.0340)(0.0330)(0.0330)(0.0430)(0.0510)(0.0330)(0.0350)(0.0840)(0.0970)(0.0340)(0.0330)

0.18.51218.465223.94953.921915.387322.811922.21433.51615.44373.69663.5413
(0.1320)(0.1270)(0.5490)(0.0540)(1.0690)(0.5020)(0.4800)(0.0560)(0.0320)(0.0340)(0.0430)

0.214.440514.126632.47185.424731.727631.472934.19793.50925.09113.62953.3470
(0.2890)(0.2620)(0.7400)(0.2220)(0.9290)(0.6900)(0.7320)(0.0460)(0.0570)(0.0420)(0.0500)

0.320.792920.605538.895330.372539.632238.589642.99843.75373.96343.66963.6053
(0.4330)(0.4680)(0.8680)(1.5450)(0.9350)(0.8480)(0.9140)(0.0530)(0.0620)(0.0460)(0.0650)

4000.04.03744.14943.51403.36263.57493.51713.30473.41493.63263.52103.2603
(0.0590)(0.0630)(0.0310)(0.0310)(0.0300)(0.0310)(0.0350)(0.0360)(0.0400)(0.0320)(0.0350)

0.111.619310.787324.35074.572525.061323.048028.28523.61274.98033.61443.1086
(0.2530)(0.1560)(0.4800)(0.0480)(0.4560)(0.3950)(0.5340)(0.0380)(0.0580)(0.0350)(0.0360)

0.220.478419.777934.844317.665335.493033.350238.83683.75034.24473.77973.0991
(0.4600)(0.4310)(0.6920)(1.4230)(0.7130)(0.6010)(0.6690)(0.0390)(0.0630)(0.0410)(0.0450)

0.329.330335.905443.730545.212444.363443.239949.80133.65274.06674.41793.5856
(0.5730)(1.1870)(0.8180)(0.8840)(0.8480)(0.7940)(0.8940)(0.0440)(0.0410)(0.0680)(0.0560)

Best performer for each case is highlighted.

RMSE= root mean squared error; S = square; H= Huber; B = bisquare; CV= cross validation; BIC = Bayes information criterion; L = robust lasso; A= robust adaptive lasso.

### Table 5

Simulation 2 - average of false rate

pπFalse rate

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
500.00.17990.09240.47820.19880.10200.46850.20690.23640.13540.46820.1908
0.10.39440.47860.49840.18040.49800.47220.22810.21450.49750.45160.1630
0.20.44210.49160.50090.19460.45750.49110.34790.19700.44820.40290.1081
0.30.46620.49520.49350.41350.42250.49460.45060.27150.25140.36400.2406

1000.00.48950.49500.46760.11870.09960.46790.18560.27080.27840.43180.1521
0.10.49420.49910.49960.13030.50000.49930.27380.19480.50000.38380.1322
0.20.48700.49800.49920.23380.48570.49830.44430.14210.47240.30370.0821
0.30.48970.49840.49890.45520.48130.49640.46770.16480.16930.18670.1616

2000.00.45410.47300.24270.08230.07330.24180.10060.19130.23790.23240.0831
0.10.49790.49980.48210.10530.48710.48670.38870.12330.49800.19970.0618
0.20.49180.49980.47670.30330.48940.48540.45980.10540.40160.14730.0464
0.30.49790.49990.48230.45010.47680.48620.47000.18350.12650.09630.1066

4000.00.10330.14330.10040.05870.11280.10030.02460.06700.12130.09980.0239
0.10.49040.49980.44290.18290.47730.46430.41620.08380.35840.08490.0193
0.20.49710.50080.44710.41140.47310.48500.44850.11530.15070.08200.0340
0.30.49790.50830.46970.47250.46990.49220.44990.14720.10240.21160.1229

Best performer is highlighted.

S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust lasso; A = robust adaptive lasso.

### Table 6

Simulation 2 - average of 100 computing times (in seconds)

pπComputing time in seconds

S.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
500.00.07810.16310.19687.22021.41340.30280.266811.97832.47020.31330.2881
0.10.08450.16440.219810.53441.61580.32130.29369.65722.00310.29070.2721
0.20.08690.16610.226113.60832.75330.32430.35306.68161.35540.26850.2193
0.30.08830.16910.221511.83412.40030.31690.334711.06932.42900.27400.2165

1000.00.06430.16010.999511.42492.44140.98790.998019.26964.24990.98820.9859
0.10.06910.16020.992813.03852.56560.98931.077317.60333.84570.98130.9150
0.20.07330.16380.988120.67594.47960.99971.080712.78852.70910.98150.7543
0.30.09270.18480.978119.25426.64220.97391.052812.88002.89420.88310.5017

2000.00.06060.17903.298324.86615.50373.27943.270033.09517.30423.27583.1974
0.10.06890.17943.213938.47488.31083.19833.380631.83696.95133.19782.7560
0.20.08690.19113.175347.974710.96283.19213.364927.66805.96343.15882.1737
0.30.11790.24883.158244.555910.31083.14953.349926.97925.87382.97051.3073

4000.00.08450.221313.032597.662622.346413.02799.2987103.979523.356413.02598.7507
0.10.12720.287813.1755165.531238.0858213.162813.5352113.437025.258513.10597.3319
0.20.15700.374513.0621172.957639.1926613.061213.4129115.946225.693312.76345.0736
0.30.19570.570212.9983157.659434.9155412.976913.3564120.272426.20459.51684.6150

S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust lasso; A = robust adaptive lasso.

### Table 7

Boston housing data - estimated coefficients and test RMSEs without outliers

VariablesS.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
(Intercept)−13.914−18.325−13.625−16.707−18.326−16.273−15.874−15.764−18.325−15.014−16.377
crim0.9550.0001.0730.0000.1010.3500.0000.0000.0000.1190.000
zn0.0110.0000.0150.0020.0000.0140.0120.0110.0000.0190.018
indus0.0060.0000.0190.0000.000−0.0150.000−0.0230.000−0.0370.000
chas0.6070.4130.5790.5000.5060.5250.0000.4680.4130.3620.000
nox−6.4530.000−6.9330.0000.0000.0000.0000.0000.0000.0000.000
rm9.1609.2539.1379.0459.2788.9919.0788.9089.2538.8419.060
age−0.046−0.035−0.048−0.042−0.039−0.053−0.053−0.050−0.035−0.057−0.059
dis−0.919−0.529−0.977−0.562−0.584−0.822−0.814−0.700−0.529−0.870−0.826
tax−0.013−0.010−0.015−0.011−0.011−0.014−0.014−0.013−0.010−0.015−0.015
ptratio−0.632−0.617−0.628−0.615−0.620−0.575−0.616−0.577−0.617−0.557−0.576
black0.0160.0110.0170.0100.0120.0130.0120.0110.0110.0120.012
lstat−0.110−0.114−0.113−0.107−0.114−0.101−0.084−0.070−0.114−0.060−0.041

Test RMSE15.8247.98417.4727.8498.1139.7617.8637.8087.9848.3948.042

RMSE = root mean squared error; S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust lasso; A = robust adaptive lasso.

### Table 8

Boston housing data - estimated coefficients and test RMSEs after outlier inclusion

VariablesS.CVS.BICS.LH.CVH.BICH.LH.AB.CVB.BICB.LB.A
(Intercept)13.19846.61161.568−15.95912.883−14.087−12.600−15.96023.007−14.195−14.963
crim0.0000.000−9.1690.000−0.6030.0700.0000.000−3.5290.0000.000
zn0.0000.0000.0270.0070.0000.0200.0200.0150.0000.0240.024
indus0.0000.000−0.481−0.020−0.278−0.0350.000−0.020−0.360−0.0450.000
chas0.0000.000−4.5430.3120.0000.3270.0000.357−0.3240.1470.000
nox0.0000.000−15.4810.0000.000−0.714−3.6010.0000.0000.000−1.502
rm7.8730.0008.7008.9759.3398.8778.9608.6789.8258.5028.728
age0.0000.0000.195−0.0420.000−0.051−0.050−0.0530.008−0.062−0.063
dis0.0000.0000.398−0.6100.000−0.892−0.921−0.7460.000−0.985−0.960
tax0.0000.0000.084−0.0070.036−0.010−0.010−0.0100.055−0.013−0.013
ptratio−0.6750.000−3.067−0.652−1.700−0.645−0.692−0.558−2.091−0.536−0.563
black0.0000.000−0.1010.010−0.0070.0120.0120.013−0.0380.0150.014
lstat−0.4920.000−1.028−0.129−0.722−0.117−0.102−0.085−0.699−0.062−0.044

Test RMSE23.18329.44198.6647.88538.7278.6088.4077.79842.8048.0388.249

RMSE = root mean squared error; S = square; H = Huber; B = bisquare; CV = cross validation; BIC = Bayes information criterion; L = robust lasso; A = robust adaptive lasso.

References
1. Bishop, CM (2006). Pattern Recognition and Machine Learning. New York: Springer
2. Buntine, WL, and Weigend, AS (1991). Bayesian back-propagation. Complex Systems. 5, 603-643.
3. Chang, L, Roberts, S, and Welsh, A (2017). Robust lasso regression using Tukey’s biweight criterion. Technometrics.from: https://dx.doi.org/10.1080/00401706.2017.1305299
4. El Ghaoui, L, and Lebret, H (1997). Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications. 18, 1035-1064.
5. Figueiredo, MAT (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 25, 1150-1159.
6. Hastie, T, Tibshirani, R, and Friedman, JH (2001). The Elements of Statistical Learning: Data Mining Inference, and Prediction. New York: Springer
7. Koenker, R (2005). Quantile Regression. Cambridge: Cambridge University Press
8. Lambert-Lacroix, S, and Zwald, L (2011). Robust regression through the Huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics. 5, 1015-1053.
9. Lee, A, Caron, F, Doucet, A, and Holmes, C (2010). A hierarchical Bayesian framework for constructing sparsity-inducing priors (Technical report). Oxford: University of Oxford
10. Lee, S, Shin, H, and Lee, SH (2016). Label-noise resistant logistic regression for functional data classification with an application to Alzheimer’s disease study. Biometrics. 72, 1325-1335.
11. MacKay, DJC (1995). Probable networks and plausible predictions: a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems. 6, 469-505.
12. Maronna, RA, Martin, RD, and Yohai, VJ (2006). Robust Statistics: Theory and Methods. Chichester: Wiley
13. Murphy, KP (2012). Machine Learning: A Probabilistic Perspective. Cambridge: The MIT Press
14. Owen, AB (2006). A robust hybrid of lasso and ridge regression (Technical report). Stanford: Stanford University
15. Park, T, and Casella, G (2008). The Bayesian lasso. Journal of the American Statistical Association. 103, 681-686.
16. Tipping, ME (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 1, 211-244.
17. West, M (1987). On scale mixtures of normal distributions. Biometrika. 74, 646-648.
18. Zou, H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 101, 1418-1429.
19. Zou, H, and Li, R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 36, 1509-1533.