TEXT SIZE

CrossRef (0)
Least absolute deviation estimator based consistent model selection in regression

K. S. Shende1,a and D. N. Kashida

aDepartment of Statistics, Shivaji University, India
Correspondence to: 1Department of Statistics, Shivaji University, Kolhapur, India (MS). E-mail: shendeks.stat99@gmail.com
Received November 28, 2018; Revised February 20, 2019; Accepted March 20, 2019.
Abstract

We consider the problem of model selection in multiple linear regression with outliers and non-normal error distributions. In this article, the robust model selection criterion is proposed based on the robust estimation method with the least absolute deviation (LAD). The proposed criterion is shown to be consistent. We suggest proposed criterion based algorithms that are suitable for a large number of predictors in the model. These algorithms select only relevant predictor variables with probability one for large sample sizes. An exhaustive simulation study shows that the criterion performs well. However, the proposed criterion is applied to a real data set to examine its applicability. The simulation results show the proficiency of algorithms in the presence of outliers, non-normal distribution, and multicollinearity.

Keywords : linear regression, model selection, consistency, robustness, sequential algorithm
1. Introduction

The primary goal of regression analysis is to evolve a useful model to accurately predict the response variable for the given values of predictors. Consider the following general multiple linear regression model

$y=Xβ+ɛ,$

where y is n×1 vector of observed values of the response variable, X is n×k full rank matrix of (k−1) predictor variables with ones in the first column, and β is corresponding k × 1 vector of an unknown regression coefficients. The ɛ is n × 1 vector of independent errors, and has the same distribution function F.

While developing the model, it is necessary to find out the unknown regression coefficients by using the appropriate method. The eminent ordinary least squares (OLS) estimator is obtained by minimizing the residual sum of squares. The OLS estimator is easy to compute and satisfies many properties. Nevertheless, the OLS method is not resistant to inconvenient observations in y space (known as outliers) and departs from the normality assumption of error in real data. The least absolute deviation (LAD) furnishes a useful and plausible alternative, resistant estimator. The LAD has many applications in Econometric and other studies. The resistant LAD estimator is obtained by minimizing the sum of absolute residuals. Dielman (2005) presented a rich literature review on LAD regression. LAD estimator has asymptotic N(β, τ2(XX)−1) distribution, τ = 1/{2 f (m)}, and f (m) is the probability density of error evaluated at the median. The τ2/n is a variance of the sample median of error. It is assumed that F(0) = 1/2 and f (0) > 0. The LAD estimator is useful for the existence of outliers and the non-normal error distribution problem.

Problems such as increase in complexity, prediction error, and economical aspects arise due to the addition of irrelevant predictor variables in the regression model. Such problems can be handled by a decisive aspect known as the model selection or variable selection procedure. Model selection has recently attracted significant attention in statistical research. The selection of a less complex model is essential. The model selection criteria are represented in the following form

$Lack of Fit+Model Complexity.$

Hence, the model selection can be done by trading off the lack of fit against model complexity. Many model selection methods have been proposed in the literature to choose a parsimonious model in multiple linear regression. Rao et al. (2001) given an extensive literature review on model selection. Most methods are based on OLS such as Mallows’s Cp (Mallows, 1973). For zero bias, the expected value of Cp is p; therefore, Mallows’s Cp selects the model for which Cp close to p. The Cp plot is a useful tool to graphically represent Mallows’s Cp. There are alternative graphical methods available to select predictor variables. Siniksaran (2008) recently suggested an alternative plot with some advantages using a geometric approach. Gilmour (1995) modified Mallows’s Cp because the expected value of Mallows’s Cp of a model which includes all relevant predictor variables is not equal to p when the mean squared error (MSE) is used as an estimate of σ2. Other methods like Akaike information criterion (AIC) (Akaike, 1973) and Bayesian information criterion (BIC) (Schwarz, 1978), are also available in the literature. Yamashita et al. (2007) studied stepwise AIC as well as other stepwise methods such as partial F, partial correlation and semi-partial correlation for variable selection in multiple linear regression that showed certain advantages of stepwise AIC.

The above methods are based on OLS or likelihood and are vulnerable to outliers. Researchers have proposed various robust variable selection methods to deal with outliers such as robust AIC (RAIC) (Ronchetti, 1985), robust BIC (RBIC) (Machado, 1993), RCp (Ronchetti and Staudte, 1994), Cp(d) (Kim and Hwang, 2000), S p (Kashid and Kulkarni, 2002), and Tharmaratnam and Claeskens (2013) compared AIC based on different robust estimators. The model selection criteria Cp, RCp,Cp(d) and AIC are inconsistent; therefore, the probability of selection of only relevant predictor variables is less than one for large sample size. Methods like BIC and GIC-LR are consistent model selection methods that select only relevant predictor variables with probability one for a large sample size (Rao et al., 2001). The BIC and GIC-LR methods are based on likelihood function and ordinary least squares (OLS) estimator respectively; however, these perform poorly in existence of outliers or departures from the normality assumption. BIC or GIC-LR methods existing in the literature are therefore consistent but not robust. To overcome this drawback, we have proposed a consistent and robust model selection criterion based on LAD estimator.

The remaining article is organized as follows. In Section 2, we propose a new variable selection criterion. We also studied its theoretical properties. In Section 3, the performance of the proposed criterion is studied through simulation and real data. The algorithms for model selection are explained in Section 4 with simulation and body fat real dataset. The article ends with some discussions of the results in Section 5.

2. Proposed method

The model (1.1) can be rewritten as

$y=X1β1+X2β2+ɛ,$

where X and β are partitioned so that X1 is a matrix of (p − 1) predictor variables with ones in the first column, and β1 is a p × 1 vector of associated regression coefficients including intercept. X2 is a matrix of (kp) predictor variables, and β2 is a (kp)×1 vector of associated regression coefficients. Consider the test for regression coefficient with the null hypothesis H0 : β2 = 0. Under the null hypothesis, the reduced model is

$y=X1β1+ɛ.$

Consider ŷf and ŷr are the predicted values of y based on full model and reduced model respectively. The predicted values are obtained using the LAD estimator of the respective models. We propose a criterion based on these fitted values of y and model complexity. It is defined as

$CRp=∣y-y^r∣′1-∣y-y^f∣′1τ2(1+k-pn-k+p)+Cn(p).$

The first term Dp = [|yŷr|′1 − |yŷf |′1]/[(τ/2){1 + (kp)/(nk + p)}] represents the lack of fit and is non-negative, 1 is the n-dimensional column vector of ones, and τ is a scale parameter that can be replaced by a suitable estimator based on a full model. The Dp is a scaled likelihood test statistic and scaled by the quantity (1 + (kp)/(nk + p)). This statistic is accurate for moderate sample size as compared to likelihood test statistic (Birkes and Dodge, 1993). For n → ∞, Dp and likelihood test statistic are equivalent. Dp = 0 for the full model and is minimum among all possible subsets; therefore, if we select a model that has a minimum Dp, then the full model is always selected. Hence, the ‘minimum Dp’ criterion does not select the parsimonious model which explain data with few predictor variables and has better prediction ability. To make a consistent criterion, consider the model complexity Cn(p) is an increasing function of model dimension (p) that often depends on sample size (n). Generally, the model dimension considered as the model complexity, but this complexity measure does not make a consistent criterion. To overcome this problem, we consider the function of the sample size and model dimension as a complexity measure. The model having small complexity will be the best model as long as discrepancy measure (Dp) is also small. The CRp criterion selects the model which has a small CRp value among all possible models. The established theoretical results of the CRp are given below:

### Proposition 1

Under the null hypothesis H0, E(CRp) = (kp) + Cn(p).

Proof

The proposed criterion is

$CRp=∣y-y^r∣′1-∣y-y^f∣′1τ2(1+k-pn-k+p)+Cn(p).$

Under the null hypothesis, Dp approximately follows χ2 distribution with kp degree of freedom (Birkes and Dodge, 1993). The expected value of CRp is

$E(CRp)=(k-p)+Cn(p).$

Hence, the proof.

Alternatively, for large n the proposed criterion can be written as

$CRaltp=τ2(∣y-y^r∣′1-∣y-y^f∣′1)+Cn(p).$

The performance of both criteria expressed in (2.3) and (2.4) will be same for large n. Consider, α1 be the subset of {1, 2, . . . , k − 1}, and α0 represents intercept. Let the selected model denoted by Mα, α = α1α0, and α0 represents a set of all necessary predictor variables. The selected model belongs to one of the following classes:

• Optimal Model: ℳo = Mo = {Mα : α = α0}

• Class of correct models: ℳc = {Mα : αα0}

• Class of wrong models: ℳw = {Mα : αα0}

Let CRpα* and CRpα** denotes the values of criterion corresponding to any correct model Mα* ∈ ℳc and wrong model Mα** ∈ ℳw with dimension pα* and pα** respectively. The ŷc and ŷw are vectors of fitted values of the respective correct model and wrong model. Under mild conditions, the Theorem 1 exhibits the consistency property of the proposed criterion for fixed k.

### Condition 1

For any Mα* ∈ ℳc and Mα** ∈ ℳw, $lim infn→∞(∣y-y^w∣′1n-∣y-y^c∣′1n)>0$.

It is expected that the average of absolute residuals of the wrong model is greater than any correct model. Thus, the difference (|yŷw|′1/n−|yŷc|′1/n) is positive, large, and Condition 1 is reasonably true.

### Condition 2

Cn(p) = o(n) and Cn(p)→∞as n→∞.

The Condition 2 is required to prove the following consistency property.

### Theorem 1. (Consistency Property)

Assume that above conditions are satisfied. Then

$limn→∞ Pr(Mα=Mo)=1.$
Proof

From the definition of criterion,

$CRpα**-CRpα*=∣y-y^w∣′1-∣y-y^f∣′1τ2(1+k-pα**n-k+pα**)-∣y-y^c∣′1-∣y-y^f∣′1τ2(1+k-pα*n-k+pα*)+Cn(pα**)-Cn(pα*)=2τ((1-k-pα**n)∣y-y^w∣′1-∣y-y^c∣′1)+2(k-pα*)τn∣y-y^c∣′1+2(pα*-pα**)τn∣y-y^f∣′1-Cn(pα**)-Cn(pα*)=2τ((1-k-pα**n)∣y-y^w∣′1-∣y-y^c∣′1)+ξ1+ξ2+Cn(pα**)-Cn(pα*),$

where

$ξ1=2(k-pα*)τn∣y-y^c∣′1 and ξ2=2(pα*-pα**)τn∣y-y^f∣′1.$

For any selected model Mα,

$∣y-y^r∣′1-∣y-y^f∣′1≤ ∣Xβ^-Xβ∣′1+∣Xαβ^α-Xαβα∣′1+∣Xβ-Xαβα∣′1.$

Whenever, Mα ∈ ℳc, = Xαβα and by consistency and asymptotic normality property (Dielman, 2005) we have |Xα β̂;αXαβα|′1 = Op(1), |yŷc|′1 = Op(1), |yŷf |′1 = Op(1) and consequently, ξ1 = ξ2 = op(1). Hence,

$lim infn→∞ Pr(CRpα**-CRpα*>0)=lim infn→∞ Pr(2τ((1-k-pα**n)∣y-y^w∣′1-∣y-y^c∣′1)+op(1)+Cn(pα**)-Cn(pα*)>0)≥Pr(lim infn→∞2τ((1-k-pα**n)∣y-y^w∣′1-∣y-y^c∣′1)+op(1)+op(n)>0)=1.$

Now, to complete the proof, it is sufficient to show that CRp selects the optimal model with probability one among the class of correct models. Consider Dpαo and Dpα* are values of Dp corresponding to the optimal and correct model respectively. Under Condition 2, we have

$limn→∞ Pr(Mα=Mo)=limn→∞ Pr(CRpαo≤CRpα*)=limn→∞ Pr(Dpαo-Dpα*<∞)=limn→∞ Pr(χpα*-pαo2<∞)=1.$

Hence, the CRp selects only all relevant predictor variables with probability one for large n.

### 2.1. Choice of τ

The CRp requires the estimation of an unknown scale parameter τ. Birkes and Dodge (1993) have the given estimator $τ^1$ of τ, and recommended to use only non-zero residuals to improve the performance. Dielman (2006) examined the performance of the likelihood ratio (LR) test, the Wald test and the Lagrange multiplier (LM) test for the testing hypothesis regarding the regression coefficient in the LAD regression. He considered four different estimators $τ^2$, $τ^3$, $τ^4$, and $τ^5$ of τ for a comparative study of these significance tests as well as showed that these types of estimators are performed well. In this study, we considered the following five existing estimators of τ to calculate CRp.

$τ^1=m(r(k2)-r(k1))4, k1=[m+12-m], k2=[m+12+m], and m=∑i=1nI(ri≠0),τ^2=m(r(m-k1-1)-r(k1))zα2, k1=[m+12-zα2m4], m=∑i=1nI(ri≠0), and α=0.05,τ^3=m(r(m-k1-1)-r(k1))zα2, k1=[m+12-zα2m4], m=n, and α=0.05,τ^4=m(r(m-k1-1)-r(k1))tα2, k1=[m+12-tα2m4], m=∑i=1nI(ri≠0), and α=0.05,τ^5=m(r(m-k1-1)-r(k1))tα2, k1=[m+12-tα2m4], m=n, and α=0.05.$

Here, r(·) denotes ordered residuals of full model, and [ · ] denotes nearest positive integer. Only nonzero residuals are considered to estimate $τ^1$, $τ^2$, and $τ^4$; however, all n residuals are considered to estimate $τ^3$ and $τ^5$. An exhaustive simulation compares the performance of these estimators in the next section.

3. Performance of CRp

In this section, an extensive simulation study checked the superiority of the proposed criterion. Also, the real-life data analysis showed an applicability of the criterion.

### 3.1. Simulation study

In this simulation study, we considered seven different penalties (Table 1). The four penalties P4P7 satisfy Condition 2, and remaining penalties are the functions of p only and do not satisfy Condition 2.

The independent predictor variables Xj, j = 1, 2, . . . , (k−1) and random errors are generated from N(0, 1) distribution. The outliers are introduced artificially in the data by multiplying 20 to response variable y corresponding to maximum absolute residuals. The simulation has been done for different sample sizes n = 30, 50, 70, 100, 200 and two different models are described below:

• Model-I: β = (5, 2, 3, 4, 0, 0)

• Model-II: β = (5, 2, 3, 4, 2, 0, 0)

In both these models, the response variable y is generated using (1.1). The performance of the proposed method is studied in terms of the percentage of an optimal model selection. The percentage of an optimal model selection in 1,000 runs are recorded in Table 2 and Table 3. It shows that CRp performs well in cases of clean data as well as outliers; however, outliers drastically affect AIC and BIC. RBIC performs uniformly better than RAIC. The performance of CRp criterion with P3P7 over RBIC is remarkable. The penalties P3P7 select an optimal model with a large percentage as compared to other penalties. It is observed that $τ^1$, $τ^2$, $τ^4$ performs better than $τ^3$ and $τ^5$. Hence, the consideration of only non-zero residuals to estimate τ results in a good percentage for a small as well as large sample size. $τ^4$ performs better compared to others for small sample sizes; however, $τ^2$ and $τ^4$ perform equally for large sample sizes. Thus, $τ^4$ performs well in cases of small as well as large sample sizes. For further study, we consider $τ^4$ as an estimator of τ. CRp criterion with all penalties performs well as the sample size increases. The simulation study confirms the consistency property of CRp criterion for P4P7 penalties.

### 3.2. Real data (Hald cement data)

The performance of the proposed criterion is examined with real-life data. This section analyze a Hald cement dataset (Ronchetti and Staudte, 1994). Hald cement data has 13 observations on the heat evolved in calories per gram of cement (y) and four ingredients in the mixture: tricalcium aluminate (X1), tricalcium silicate (X2), tetracalcium aluminoferrite (X3) and dicalcium silicate (X4). Many researchers have considered this data for model selection problem and suggested X1, X2 predictor variables for Hald data. The 6th observation has maximum absolute residual, to introduce an outlier by replacing the 6th observation to 200 (Ronchetti and Staudte, 1994; Kashid and Kulkarni, 2002). The values of CRp, AIC, BIC, RAIC, and RBIC for all possible subsets are recorded for original and outlier data in Tables 4 and 5. It is observed that the presence of outliers do not affect the value of CRp. The CRp criterion with all penalties selects X1, X2 variables for clean data as well as outlier data. The AIC criterion selects X1, X2, X4 variables in clean data, and selects X1, X4 variables in the case of an outlier. However, BIC selects X1, X2 variables in clean data, but in case of an outlier it selects only X4 variable. RAIC and RBIC select same variable X3 only in clean data, and X1 only in presence of an outlier.

The selection of a model from all possible subsets will become more complicated and time consuming as the number of predictor variables increase. For example, if k − 1 = 30 then it is necessary to check more than a billion subsets for model selection. So, in this situation, it is reasonable to use a kick-off (Rao and Wu, 1989) or stepwise approach.

4. Algorithms for model selection

The kick-off method is based on an OLS estimator that is not robust to outliers in the data. To overcome this problem, we have modified the kick-off approach based on the LAD estimator for variable selection. The CRp based kick-off method is explained below.

• Kick-off method

• 1) Calculate Di = CRkiCn(k), where CRki is the value of criterion corresponding to predictor variables excluding the ith predictor variable and Cn(k) penalty function of the full model.

• 2) If Di ≤ 0 then βi = 0, else βi ≠ 0, i = 1, 2, . . . , k − 1. Hence, select predictor variables for which Di > 0.

Alternative sequential and stepwise algorithms are described below. Let is an index set of selected predictor variables. The sum of absolute residuals for is |y − Median(y)|′1.

• Sequential method

• 1) Consider LAD estimator β̂ of the full model, and using the statistical test explained by Birkes and Dodge (1993, pp. 76–77) to test the null hypothesis, H0 : β{j:|β̂j|≤Median(|β̂ j|)} = 0. If the null hypothesis is rejected at α% level of significance, then repeat Steps 3.1–3.3 until we get final model. If null hypothesis is not rejected, then repeat Steps 2.1–2.3.

• 2) Forward direction:

• 2.1) Initially, consider null set.

• 2.2) Add a new predictor variable to the previous set if and i.e., the difference is positive and large over all unselected predictor variables (ℱ).

• 2.3) Repeat Step 2.2 until no other variable is selected.

• 3) Backward direction:

• 3.1) Initially, consider .

• 3.2) Delete predictor variable if and i.e., is non-negative and large over all selected predictor variables ( ).

• 3.3) Repeat Step 3.2 until no other variable is deleted.

• Stepwise method

• 1) Initially, consider null set.

• 2) Add a new jth ∈ ℱ predictor variable to the previous set if and .

• a) If any new predictor variable is not included in the null set or a singleton set, then stop.

• b) If , then repeat same Step 2, else go to the next step. | | is a cardinality of set .

• 3) Delete predictor variable if and and go to Step 2.

• 4) Continue Step 2 and Step 3 until consequent does not change.

### 4.1. Addition and deletion criteria

• Addition: The new jth ∈ ℱ predictor variable is added to the previous set if and maximum.

i.e., and maximum

$⇔2τn((n-k+∣Տ∣)∣y-y^Տ∣′1-(n-k+∣Տ∣+1)∣y-y^Տ∪j∣′1)+2τn∣y-y^f∣′1+Cn(∣Տ∣)-Cn(∣Տ∣+1)maximum$

$⇔2τn(n-k+∣Տ∣)(∣y-y^Տ∣′1-∣y-y^Տ∪j∣′1-1n-k+∣Տ∣∣y-y^Տ∪j∣′1)maximum$

$⇔ψ1=2τ(∣y-y^Տ∣′1-∣y-y^Տ∪j∣′1)maximum$

Here, and are vectors of fitted values obtained from set of predictor variables corresponding to sets and respectively. The ψ1 follows distribution (Birkes and Dodge, 1993); therefore, select Xj if ψ1 is maximum and .

• Deletion: We delete predictor variable Xl from existing set if and maximum over all selected predictor variables.

i.e., maximum

and minimum

$⇔2τn((n-k+∣Տ∣-1)∣y-y^Տ-l∣′1-(n-k+∣Տ∣)∣y-y^Տ∣′1)+2τn∣y-y^f∣′1+Cn(∣Տ∣-1)-Cn(∣Տ∣)minimum$

$⇔2τn(n-k+∣Տ∣-1)(∣y-y^Տ-l∣′1-∣y-y^Տ∣′1-1n-k+∣Տ∣∣y-y^Տ∣′1)minimum$

$⇔ψ2=2τ(∣y-y^Տ-l∣′1-∣y-y^Տ∣′1)minimum$

The ψ2 follows distribution (Birkes and Dodge,1993) and delete Xl if ψ2 is minimum and .

Alternatively, we can select Xj if $ψ1>χα,12$ and delete Xl if $ψ2<χα,12$ for large n. Thus, the minimization and distribution based addition and deletion rules are equivalent.

Corollary 1

The kick-off, sequential and stepwise algorithms select the optimal model with probability one for a large sample size.

Proof

The proof is given separately for kick-off algorithm and other two algorithms.

• Kick-off method: If relevant predictor variable is deleted, then the reduced model belongs to ℳw. For the full model Dp = 0, and the full model belongs toℳc. By (2.7),

$lim infn→∞ Pr (D-i>0)=1$

Similarly, if irrelevant predictor variable deleted, then the reduced model belongs toℳc. By Condition 2 and |yŷc|′1 = |yŷf |′1 = Op(1),

$lim infn→∞ Pr(D-i<0)=lim infn→∞ Pr(CRk-i-Cn(k)<0)=lim infn→∞ Pr(Op(1)+Cn(k-1)-Cn(k)<0)=Pr(lim infn→∞Op(1)+Cn(k-1)-Cn(k)<0)=1.$

Hence, the kick-off method selects only relevant predictor variables with probability one for large n.

• Stepwise and sequential method:

• ○ Addition: Consider r1 ∈ ℱ and r2 ∈ ℱ are indices corresponding to the relevant and irrelevant predictor variable respectively. After adding r1 in the present set , the value of |yŷ|′1 is smaller than after adding r2 in a set . It is equivalent to hold ∀r1, r2 ∈ ℱ.

Since, (say)

$CRp(Տ∪{r2})-CRp(Տ∪{r1})=2τ((1-k-s1n)∣y-y^Տ∪{r2}∣′1-∣y-y^Տ∪{r1}∣′1)+2(k-s1)τn∣y-y^Տ∪{r1}∣′1=2τ(1-k-s1n)(∣y-y^Տ∪{r2}∣′1-∣y-y^Տ∪{r1}∣′1)$

and

$lim infn→∞ Pr(CRp(Տ∪{r2})>CRp(Տ∪{r1}))≥Pr(lim infn→∞2τ(1-k-s1n)(∣y-y^Տ∪{r2}∣′1-∣y-y^Տ∪{r1}∣′1)>0)=1.⇒lim infn→∞ Pr(CRp(Տ)-CRp(Տ∪{r1})>CRp(Տ)-CRp(Տ∪{r2}))=1.$

• ○ Deletion: Suppose and are indices corresponding to the relevant and irrelevant predictor variable respectively. If we delete r3 and r4 from the present set , then hold .

Since, (say)

$CRp(Տ-r3)-CRp(Տ-r4)=2τ((1-k-s2n)∣y-y^Տ-r3∣′1-∣y-y^Տ-r4∣′1)2(k-s2)τn∣y-y^Տ-r4∣′1=2τ(1-k-s2n)(∣y-y^Տ-r3∣′1-∣y-y^Տ-r4∣′1)$

and

$lim infn→∞ Pr(CRp(Տ-r3)>CRp(Տ-r4))≥Pr(lim infn→∞2τ(1-k-s2n)(∣y-y^Տ-r3∣′1-∣y-y^Տ-r4∣′1)>0)=1.⇒lim infn→∞ Pr(CRp(Տ)-CRp(Տ-r3)

• Stopping: By Theorem 1, if the present set S is an index set corresponding to optimal model then

$limn→∞ Pr(CRp(Տ)-CRp(Տ∪{r2})<0)=1, ∀r2∈F,limn→∞ Pr(CRp(Տ)-CRp(Տ-r3)<0)=1, ∀r3∈Տ.$

By (4.4), (4.6), and (4.7), the procedure of addition of relevant predictor variable ($r1th∈F$) and deletion of irrelevant predictor variable ($r4th∈Տ$) continue until getting optimal model, and the algorithms select the optimal model with probability one for large n.

The CRp, AIC, BIC, RAIC, and RBIC criteria requires computing 2k−1 − 1 criterion values to select the optimal model; however, the kick-off method needs to check only k − 1 criterion values. In the sequential method, we fix the forward or backward direction to minimize time by using step 1; therefore, the sequential method requires $1+Σi=max(pαo,k-pαo-1)k-1i$ criterion values, pαo is an actual number of relevant predictor variables. The stepwise method might be required more than sequential algorithm steps, but not more than 2k−1 − 1. Hence, the stepwise method is more time-consuming when compared to the other two algorithms.

4.2. Performance and scalability of algorithms

### 4.2.1. Simulation study

The performance of LAD estimator based algorithms are studied through simulation. The predictor variables Xj, j = 1, 2, . . . , k − 1 are generated from N(0, ∑). Here, ∑ is a symmetric positive definite matrix such that ∑ii = 1, i = 1, 2, . . . , k − 1 and ∑i j = 0.25, ij = 1, 2, . . . , k − 1. The errors are generated from N(0, 1) distribution with the response variable generated using regression coefficients β = (5, 2, . . . , 2, 0, . . . , 0), where 5 is intercept, and only 10 regression coefficients are non-zero in 100 coefficients. Therefore, only 10% predictor variables are significant in the simulated data. Outliers are introduced in the data by multiplying 20 to response variable y corresponding to maximum absolute residuals. The simulation is carried out for different sample sizes n = 200, 300, 400, 500, and the simulation results are recorded in Tables 68. In each table, we record the percentage of the optimal model selection in 1,000 runs considering only P4P7 penalties. It is expected that the criterion selects only the first 10 predictor variables.

The simulation is carried out for 2%, 4%, 6%, 8%, 10% contamination of outliers in the data, and results are given in Table 6. The kick-off method performs poorly compared to sequential and stepwise methods. It is observed that the stepwise method performs well compared to the sequential method for the large k/n ratio; however, both methods perform equally for small k/n. The kick-off method is also performs well for k/n = 1/5, and selects the optimal model with at least 75% accuracy.

The performance of the algorithms for non-normal error distribution has been studied using the same model described above. We presented the simulation results (Table 7) with N(0, 3), 0.95N(0, 1)+ 0.05N(0, 3), 0.9N(0, 1) + 0.1N(0, 3), t2, Slash, Cauchy (0, 1), Laplace (0, 1) error distributions. The sequential and stepwise method also performs reasonably well compared to the kick-off method in this case. All these algorithms have low percentages of model selection for large k/n and Slash, Cauchy distributions. However, the percentage of optimal model selection increases as the sample size increases for Slash and Cauchy error distributions. It is observed that the performance of criterion varies with a penalty function.

We also checked proficiency of the algorithms in the presence of multicollinearity (Table 8). The capability of these methods is assessed by varying cov(Xi, Xj) = ∑i j values. The high value of ∑i j reveals severe multicollinearity. The percentage of the optimal model selection using algorithms decays quickly for larger values of ∑i j and k/n. The algorithms select an optimal model with high precision up to the moderate multicollinearity. However, algorithms select optimal model for small k/n with high precision.

Thus, algorithms select an optimal model with a high percentage, and the performance mostly depends on the ratio k/n. The sequential method is faster than stepwise and performs well compared to the kick-off.

### 4.2.2. Body fat dataset

A body fat dataset is freely available in R software and contains physical measurements of 252 males. Measuring body fat is difficult compared to measuring height and weight. The physical measurements are more informative to get the percentage of body fat. The percentage of body fat is calculated using Bronzek’s equation and density, and it is considered as the response variable. Another 15 variables mentioned in Table 9 are considered as predictor variables. It is observed that the data have outliers and residuals do not follow a Normal distribution (Figure 1). Consequently, the non-resistant model selection methods will not be appropriate for this dataset. The predictor variables selected by the proposed criterion with different penalties (CRp), AIC, BIC, RAIC, RBIC, and other algorithms (Kick-Off (KOp), Sequential (S p), and Stepwise (STp)) are indicated in Table 9. The CRp, KOp, STp selects only two predictor variables, weight and fat-free weight; however, a sequential method selects one more predictor variable abdomen circumference. However, the AIC, BIC, RAIC, and RBIC select different predictor variables. For a detailed study, we compared the prediction error of these methods.

The 70% (176) observations are randomly chosen to select the significant predictor variables and the remaining 30% (76) observations are used to calculate prediction error using selected predictor variables and their LAD estimator. The performance of the proposed criterion with different penalties (CRp), AIC, BIC, RAIC, RBIC, and other algorithms (Kick-Off (KOp), Sequential (S p), and Stepwise (STp)) has been examined by the root mean square prediction error $(RMSPE)=Σi=176(yi-y^i)2/76$. This procedure is repeated 1,000 times. In Figure 2, boxplots of RMSPE for different methods are plotted. It is observed that CRp criterion, sequential and stepwise methods have a small RMSPE with low variation. However, the kick-off method with all penalties has a small RMSPE with a small variation excluding P6 and P7. The RMSPE of the model selected by AIC and BIC is smaller compared to RAIC and RBIC. The RMSPE of RAIC and RBIC indicate that both criteria do not select a good model for this dataset. Thus, the real-life example reveals the scalability and stability of algorithms.

5. Discussion

We have studied a robust model selection method for a class of different penalties. It is observed that the criterion with the penalty satisfying Condition 2 performs well. It is shown that the model selection criterion is consistent. The CRp criterion is time consuming when the number of predictor variables (k) increases. LAD estimator-based algorithms will be the best option to overcome this problem. These algorithms work well for outlier data as well as the non-normality of the error term. The time required to select an optimal model for these algorithms is less than searching all possible subsets; consequently, the sequential method is preferable. Criterion based algorithms are therefore shows to have advantages such as robustness, consistency and fast.

Acknowledgement

We sincerely thank the Editor, Associate Editor and anonymous reviewers for their careful review and constructive comments which led to the significant improvement of this article.

Figures
Fig. 1. Normal probability plot.
Fig. 2. Box plot of root mean square prediction error.
TABLES

### Table 1

Penalty functions

Sr. No.Penalty function Cn(p)
1P1 = 2p
2P2 = 3p
3P3 = 2p log(p)
4P4 = p log(n)
5P5 = p(log(n) + 1)
6$P6=pn$
7$P7=p(n+2)$

### Table 2

Percentage of optimal model selection (Model-I)

nNo. of OutliersτCRpAICBICRAICRBIC

P1P2P3P4P5P6P7
300$τ^1$74.086.195.788.894.596.798.664.684.054.792.0
$τ^2$87.195.098.896.498.399.299.5
$τ^3$66.978.690.481.888.291.895.7
$τ^4$91.897.699.598.099.499.498.8
$τ^5$65.377.789.581.187.191.095.3

1$τ^1$65.779.493.083.090.594.998.213.912.543.584.8
$τ^2$80.491.198.693.997.998.899.2
$τ^3$57.170.585.173.881.386.892.6
$τ^4$87.296.199.797.299.299.698.3
$τ^5$55.669.284.372.980.086.091.5

2$τ^1$64.277.690.980.588.892.696.53.32.043.380.8
$τ^2$78.589.297.192.095.897.697.9
$τ^3$54.669.484.773.680.686.492.3
$τ^4$85.593.398.494.697.898.497.5
$τ^5$53.768.383.572.079.486.091.8

3$τ^1$62.077.290.480.186.891.996.02.40.740.378.7
$τ^2$77.787.995.890.494.596.198.2
$τ^2$54.366.880.070.377.082.688.2
$τ^4$83.492.898.194.897.298.297.1
$τ^5$53.065.479.168.976.681.487.8

500$τ^1$73.685.795.791.995.098.499.364.587.857.794.1
$τ^2$89.095.599.098.199.099.999.9
$τ^3$66.479.991.588.091.295.898.1
$τ^4$88.395.398.898.098.799.999.9
$τ^5$77.788.396.293.696.298.899.2

1$τ^1$68.983.793.789.593.698.299.423.622.857.190.8
$τ^2$85.994.499.297.899.099.799.9
$τ^3$62.976.990.085.789.894.797.5
$τ^4$85.394.199.197.899.099.699.9
$τ^5$74.585.395.191.194.998.699.3

2$τ^1$68.680.993.188.192.997.499.07.84.552.988.3
$τ^2$84.493.198.797.198.699.599.8
$τ^3$64.076.389.683.188.994.796.8
$τ^4$83.692.798.696.598.699.599.7
$τ^5$72.784.794.690.194.198.098.9

3$τ^1$67.780.992.386.892.297.198.86.11.550.386.4
$τ^2$83.492.898.296.498.199.499.9
$τ^3$60.874.686.881.186.093.897.0
$τ^4$82.992.798.196.197.999.499.9
$τ^5$71.683.693.688.892.997.499.0

700$τ^1$74.185.996.194.896.598.799.565.790.762.395.3
$τ^2$89.096.098.798.198.9100.0100.0
$τ^3$78.890.896.595.496.999.299.8
$τ^4$88.495.798.698.098.8100.0100.0
$τ^5$77.990.396.495.496.799.299.8

1$τ^1$72.385.494.292.194.798.699.436.633.461.193.9
$τ^2$88.294.098.597.498.799.999.9
$τ^3$77.788.795.894.096.199.499.7
$τ^4$87.993.698.297.498.699.999.9
$τ^5$77.088.195.494.095.899.499.7

2$τ^1$69.883.195.091.895.599.099.716.49.856.992.6
$τ^2$84.494.898.797.999.3100.0100.0
$τ^3$74.687.396.494.296.899.599.9
$τ^4$83.994.498.797.799.0100.0100.0
$τ^5$74.386.796.094.096.799.599.8

3$τ^1$69.981.893.490.194.098.599.110.44.154.790.8
$τ^2$83.792.598.297.198.399.699.8
$τ^3$73.885.994.793.095.799.199.4
$τ^4$83.192.498.297.098.399.699.8
$τ^5$73.385.594.692.695.499.199.3

1000$τ^1$72.185.395.194.196.499.599.867.892.666.595.7
$τ^2$88.896.399.198.899.4100.0100.0
$τ^3$78.889.296.996.497.499.7100.0
$τ^4$88.696.299.198.899.4100.0100.0
$τ^5$78.489.196.996.497.399.7100.0

1$τ^1$69.583.493.492.295.999.799.947.143.962.694.2
$τ^2$88.195.099.799.499.8100.0100.0
$τ^3$75.787.796.195.197.899.799.9
$τ^4$87.994.699.799.499.7100.0100.0
$τ^5$75.487.595.994.997.599.799.9

2$τ^1$67.181.093.691.295.499.299.530.617.858.293.3
$τ^2$85.995.098.598.199.099.9100.0
$τ^3$74.486.496.195.196.999.399.7
$τ^4$85.795.098.498.199.099.9100.0
$τ^5$73.986.195.795.096.999.399.7

3$τ^1$69.082.194.193.496.099.199.822.58.558.294.2
$τ^2$88.195.098.898.299.199.9100.0
$τ^3$74.687.295.794.997.199.599.8
$τ^4$87.795.098.898.299.199.9100.0
$τ^5$74.587.095.694.796.999.599.8

2000$τ^1$69.783.394.094.697.1100.0100.067.094.666.097.0
$τ^2$88.596.099.799.899.8100.0100.0
$τ^3$82.391.798.098.799.3100.0100.0
$τ^4$88.495.999.799.799.8100.0100.0
$τ^5$81.991.798.098.699.3100.0100.0

1$τ^1$70.983.495.395.997.5100.0100.065.472.966.597.5
$τ^2$90.396.699.399.799.8100.0100.0
$τ^3$82.292.798.298.699.4100.0100.0
$τ^4$90.196.699.399.799.8100.0100.0
$τ^5$82.192.698.298.699.3100.0100.0

2$τ^1$67.781.193.794.497.099.9100.062.954.467.596.4
$τ^2$87.395.699.399.399.7100.0100.0
$τ^3$80.090.398.398.598.9100.0100.0
$τ^4$87.095.499.299.399.7100.0100.0
$τ^5$79.690.198.398.598.9100.0100.0

3$τ^1$69.582.393.494.296.5100.0100.052.137.667.396.2
$τ^2$86.895.399.599.699.9100.0100.0
$τ^3$81.291.397.998.399.3100.0100.0
$τ^4$86.895.199.599.699.9100.0100.0
$τ^5$81.091.297.998.399.3100.0100.0

### Table 3

Percentage of optimal model selection (Model-II)

nNo. of OutliersτCRpAICBICRAICRBIC

P1P2P3P4P5P6P7
300$τ^1$78.088.297.490.795.297.599.261.177.748.989.2
$τ^2$89.295.799.297.299.099.096.0
$τ^3$54.366.985.470.878.685.491.0
$τ^4$87.995.399.396.798.999.097.0
$τ^5$52.665.384.468.977.784.590.7

1$τ^1$71.581.895.286.292.195.597.09.78.844.984.4
$τ^2$83.892.997.694.796.997.294.9
$τ^3$50.262.077.764.871.778.085.7
$τ^4$82.692.397.794.196.897.395.8
$τ^5$48.361.176.663.970.476.884.9

2$τ^1$70.282.694.685.191.694.897.61.91.239.982.0
$τ^2$83.792.697.894.296.597.395.2
$τ^3$46.460.577.563.971.677.984.7
$τ^4$82.591.797.893.795.997.396.4
$τ^5$45.359.276.263.070.576.484.0

3$τ^1$65.079.993.683.990.393.597.51.00.239.876.6
$τ^2$80.391.297.493.796.397.294.4
$τ^3$43.556.974.160.268.974.383.1
$τ^4$78.690.097.392.796.196.994.8
$τ^5$42.155.873.358.967.573.382.4

500$τ^1$74.888.297.093.196.398.899.766.087.757.993.7
$τ^2$86.594.099.397.098.799.899.9
$τ^3$62.073.988.080.884.993.096.1
$τ^4$90.296.499.598.799.4100.0100.0
$τ^5$73.283.794.689.894.097.298.3

1$τ^1$72.784.695.690.694.497.999.118.716.053.688.5
$τ^2$82.692.198.295.797.999.299.8
$τ^3$57.769.987.179.985.292.495.1
$τ^4$87.694.899.097.598.799.799.9
$τ^5$69.781.294.388.692.596.497.9

2$τ^1$70.284.395.691.294.396.998.85.03.251.688.8
$τ^2$82.291.697.695.497.498.899.6
$τ^3$55.668.286.977.684.692.194.4
$τ^4$87.294.198.996.998.399.7100.0
$τ^5$67.481.492.688.191.995.697.8

3$τ^1$68.680.495.088.493.397.698.92.10.747.386.5
$τ^2$78.990.897.995.397.499.199.8
$τ^3$53.966.084.274.281.989.294.1
$τ^4$83.593.898.997.198.499.8100.0
$τ^5$65.676.892.184.090.096.598.1

700$τ^1$70.483.196.792.696.599.699.867.089.160.095.6
$τ^2$87.896.099.799.099.7100.0100.0
$τ^3$73.485.495.892.295.499.099.8
$τ^4$87.295.899.798.899.799.9100.0
$τ^5$72.885.295.491.995.299.099.8

1$τ^1$70.182.595.091.194.499.399.626.026.257.890.9
$τ^2$86.895.099.598.499.2100.0100.0
$τ^3$72.883.695.891.995.299.299.9
$τ^4$86.594.699.298.299.2100.0100.0
$τ^5$72.183.295.491.795.099.199.9

2$τ^1$68.181.894.991.094.799.099.69.45.253.192.1
$τ^2$86.895.099.598.699.4100.0100.0
$τ^3$68.483.795.691.895.498.799.3
$τ^4$86.194.599.598.499.4100.0100.0
$τ^5$67.983.095.591.695.198.799.2

3$τ^1$66.780.793.589.593.198.199.14.60.855.189.9
$τ^2$84.593.798.497.498.499.9100.0
$τ^3$70.182.493.789.793.198.498.9
$τ^4$84.092.998.497.498.399.9100.0
$τ^5$69.782.093.589.592.898.398.9

1000$τ^1$71.983.696.494.196.699.899.868.293.163.795.3
$τ^2$85.794.799.498.999.5100.0100.0
$τ^3$73.384.795.894.196.299.699.8
$τ^4$88.996.199.799.499.8100.0100.0
$τ^5$73.284.495.693.896.099.699.8

1$τ^1$72.885.397.095.897.599.899.838.437.762.795.7
$τ^2$88.996.399.799.299.799.9100.0
$τ^3$75.587.397.595.597.799.599.9
$τ^4$90.397.199.799.599.8100.0100.0
$τ^5$75.287.097.595.597.699.599.9

2$τ^1$71.383.195.493.595.999.899.918.811.163.694.5
$τ^2$85.193.899.398.999.6100.0100.0
$τ^3$72.484.896.093.996.599.599.7
$τ^4$87.595.599.799.399.7100.0100.0
$τ^5$72.084.296.093.896.599.599.7

3$τ^1$68.982.194.991.995.199.299.711.33.559.692.2
$τ^2$84.193.698.597.998.899.999.9
$τ^3$70.382.894.992.195.499.399.6
$τ^4$87.195.099.098.699.199.9100.0
$τ^5$69.982.594.691.995.299.399.6

2000$τ^1$69.683.195.995.697.299.699.969.794.266.196.8
$τ^2$88.596.299.399.299.3100.0100.0
$τ^3$79.789.497.997.798.7100.0100.0
$τ^4$88.496.199.399.199.3100.0100.0
$τ^5$79.589.297.997.698.7100.0100.0

1$τ^1$72.185.996.796.697.999.999.960.459.467.897.3
$τ^2$91.696.999.699.699.8100.0100.0
$τ^3$81.391.698.498.398.7100.0100.0
$τ^4$91.596.999.699.699.8100.0100.0
$τ^5$81.091.698.498.398.7100.0100.0

2$τ^1$71.784.095.495.197.599.999.948.839.266.995.5
$τ^2$89.396.199.499.399.9100.0100.0
$τ^3$82.491.498.098.098.899.9100.0
$τ^4$89.296.099.399.399.9100.0100.0
$τ^5$82.391.398.097.998.799.9100.0

3$τ^1$72.384.496.696.297.9100.0100.041.919.564.196.7
$τ^2$90.296.999.999.8100.0100.0100.0
$τ^3$80.290.798.298.199.4100.0100.0
$τ^4$89.796.999.999.7100.0100.0100.0
$τ^5$80.190.798.298.199.4100.0100.0

### Table 4

Hald Cement data (original)

Sr. No.SubmodelCRpAICBICRAICRBIC

P1P2P3P4P5P6P7
1X122.327224.327221.099723.457125.457125.538329.5383102.4119104.106712.29567.1850
2X217.318919.318916.091518.448820.448820.530024.530098.070499.765216.09929.1288
3X327.133429.133425.906028.263330.263330.344534.3445107.9598109.654710.80626.7160
4X417.491019.491016.263618.620920.620920.702124.702197.744099.438912.42517.4164
5X1, X26.97819.97817.56988.673011.673011.794817.794864.312466.572215.51919.1283
6X1, X326.114329.114326.706027.809130.809130.930936.9309104.0091106.268911.27797.7057
7X1, X47.026610.02667.61838.721511.721511.843317.843367.634169.893912.14048.1164
8X2, X314.254717.254714.846415.949618.949619.071425.071489.929592.189318.998010.7771
9X2, X420.612723.612721.204322.307525.307525.429331.429399.5217101.781516.988910.1520
10X3, X49.755412.755410.347011.450214.450214.572020.572078.745081.004829.297714.4117
11X1, X2, X38.055212.055211.145510.315014.315014.477422.477463.903666.728315.30619.9429
12X1, X2, X48.206112.206111.296510.465914.465914.628322.628363.866366.691014.58499.6446
13X1, X3, X48.340512.340511.430810.600314.600314.762722.762764.620067.444711.81748.7352
14X2, X3, X48.920512.920512.010911.180315.180315.342723.342769.468372.293018.942510.9302
15X1, X2, X3, X410.000015.000016.094412.824717.824718.027828.027865.836769.226418.614011.9214

### Table 5

Hald Cement data (with outlier, y6 = 200)

Sr. No.SubmodelCRpAICBICRAICRBIC

P1P2P3P4P5P6P7
1X122.327224.327221.099723.457125.457125.538329.5383129.1893130.884225.684414.0531
2X217.318919.318916.091518.448820.448820.530024.5300129.1579130.852748.644125.1826
3X327.133429.133425.90628.263330.263330.344534.3445130.8619132.556726.095914.0554
4X417.491019.491016.263618.620920.620920.702124.7021128.9758130.670635.343318.6954
5X1, X26.97819.97817.56988.673011.673011.794817.7948128.5246130.7844120.787561.7610
6X1, X326.114329.114326.706027.809130.809130.930936.9309131.0793133.339128.299515.8754
7X1, X47.026610.02667.61838.721511.721511.843317.8433128.4488130.708695.910649.7983
8X2, X314.254717.254714.846415.949618.949619.071425.0714129.7412132.001066.913834.7348
9X2, X420.612723.612721.204322.307525.307525.429331.4293130.9744133.234248.800125.7217
10X3, X49.755412.755410.347011.450214.450214.572020.5720128.9457131.2055122.899261.2152
11X1, X2, X38.055212.055211.145510.315014.315014.477422.4774130.4785133.3033126.141265.3541
12X1, X2, X48.206112.206111.296510.465914.465914.628322.6283130.4350133.2597121.217162.9619
13X1, X3, X48.340512.340511.430810.600314.600314.762722.7627130.4121133.2369103.829554.7412
14X2, X3, X48.920512.920512.010811.180315.180315.342723.3427130.3519133.1767117.713260.3156
15X1, X2, X3, X410.000015.000016.094412.824717.824718.027828.0278132.3519135.7416137.637471.4468

### Table 6

Performance of algorithms in presence of outliers

n (k/n)% of outliersKick-off methodSequential methodStepwise method

P4P5P6P7P4P5P6P7P4P5P6P7
200 (1/2)0%82.591.4100.0100.096.396.396.496.499.999.9100.0100.0
2%68.880.899.7100.094.494.694.694.699.8100.0100.0100.0
4%60.075.299.499.896.396.596.796.799.699.8100.0100.0
6%56.871.298.899.394.795.796.095.998.799.7100.099.9
8%57.772.299.499.993.893.994.093.799.899.9100.099.7
10%58.473.198.497.994.094.494.292.999.6100.099.898.3

300 (1/3)0%85.493.7100.0100.098.899.399.699.699.299.7100.0100.0
2%79.987.9100.0100.098.298.998.998.999.3100.0100.0100.0
4%75.086.5100.0100.097.298.298.898.898.499.4100.0100.0
6%67.680.5100.0100.097.498.499.499.498.099.0100.0100.0
8%65.979.399.9100.097.699.199.699.698.099.5100.0100.0
10%61.275.6100.0100.097.398.699.599.597.899.1100.0100.0

400 (1/4)0%90.295.6100.0100.098.499.299.899.898.699.4100.0100.0
2%83.992.1100.0100.097.899.3100.0100.097.899.3100.0100.0
4%79.189.7100.0100.097.099.299.999.997.199.3100.0100.0
6%75.685.9100.0100.096.398.299.799.796.598.5100.0100.0
8%70.183.2100.0100.097.098.299.799.797.298.5100.0100.0
10%66.079.2100.0100.095.398.299.799.795.698.5100.0100.0

500 (1/5)0%93.797.6100.0100.098.599.4100.0100.098.599.4100.0100.0
2%88.395.3100.0100.097.799.0100.0100.097.799.0100.0100.0
4%88.394.3100.0100.098.099.4100.0100.098.099.4100.0100.0
6%83.189.5100.0100.096.598.7100.0100.096.598.7100.0100.0
8%77.688.3100.0100.096.099.2100.0100.096.099.2100.0100.0
10%75.185.8100.0100.094.097.5100.0100.094.097.5100.0100.0

### Table 7

Performance of algorithms in presence of non-normal errors

n (k/n)Distribution of errorKick-off methodSequential methodStepwise Method

P4P5P6P7P4P5P6P7P4P5P6P7
200 (1/2)N(0, 3)81.888.789.180.695.195.179.762.999.899.882.164.0
0.95N(0, 1) + 0.05N(0, 3)81.091.0100.0100.095.895.895.895.899.999.999.999.9
0.9N(0, 1) + 0.1N(0, 3)82.490.299.999.996.897.097.096.999.8100.0100.099.9
t283.491.394.189.395.995.988.476.8100.0100.090.177.2
Slash20.015.20.40.264.347.60.50.470.850.60.40.4
Cauchy (0, 1)59.954.57.94.887.182.210.34.697.391.610.54.1
Laplace (0, 1)82.491.099.899.596.896.896.796.4100.0100.099.999.1

300 (1/3)N(0, 3)88.494.3100.0100.098.599.099.499.499.199.6100.0100.0
0.95N(0, 1) + 0.05N(0, 3)86.492.4100.0100.098.899.499.999.998.999.5100.0100.0
0.9N(0, 1) + 0.1N(0, 3)85.091.4100.0100.098.999.499.699.699.399.8100.0100.0
t288.995.4100.0100.099.299.399.499.499.899.9100.0100.0
Slash88.493.833.823.398.098.058.539.7100.0100.056.837.0
Cauchy (0, 1)87.694.194.290.497.197.195.592.5100.0100.098.395.0
Laplace (0, 1)89.795.5100.0100.099.499.699.899.899.699.8100.0100.0

400 (1/4)N(0, 3)91.596.5100.0100.098.799.299.999.998.899.3100.0100.0
0.95N(0, 1) + 0.05N(0, 3)90.395.6100.0100.098.999.8100.0100.098.999.8100.0100.0
0.9N(0, 1) + 0.1N(0, 3)90.495.2100.0100.098.899.799.999.998.999.8100.0100.0
t291.295.9100.0100.099.8100.0100.0100.099.8100.0100.0100.0
Slash92.497.093.389.299.799.798.395.9100.0100.098.395.7
Cauchy (0, 1)91.895.5100.0100.099.999.999.999.9100.0100.0100.0100.0
Laplace (0, 1)92.596.4100.0100.099.699.8100.0100.099.699.8100.0100.0

500 (1/5)N(0, 3)92.997.2100.0100.098.599.4100.0100.098.599.4100.0100.0
0.95N(0, 1) + 0.05N(0, 3)92.897.1100.0100.098.399.4100.0100.098.399.4100.0100.0
0.9N(0, 1) + 0.1N(0, 3)92.697.0100.0100.098.799.8100.0100.098.799.8100.0100.0
t293.497.4100.0100.099.699.9100.0100.099.699.9100.0100.0
Slash95.398.299.799.499.799.899.999.999.899.9100.0100.0
Cauchy (0, 1)93.997.4100.0100.0100.0100.0100.0100.0100.0100.0100.0100.0
Laplace (0, 1)93.397.2100.0100.099.699.8100.0100.099.699.8100.0100.0

### Table 8

Performance of algorithms in presence of multicollinearity

n (k/n)i jKick-off methodSequential methodStepwise method

P4P5P6P7P4P5P6P7P4P5P6P7
200 (1/2)0.0081.190.299.8100.099.799.995.187.699.8100.094.987.6
0.5080.589.099.999.987.887.987.987.999.9100.0100.099.8
0.5580.590.510099.687.087.387.386.399.699.999.898.6
0.6081.590.299.799.385.085.184.88499.9100.099.698.5
0.6580.890.299.197.984.284.284.182.599.999.999.697.4
0.7079.888.797.793.781.781.880.576.699.899.998.592.6
0.7581.790.489.880.281.781.975.667.299.7100.091.578.7
0.8080.589.671.457.576.376.461.747.499.899.975.955.7
0.9061.054.56.02.771.966.34.01.693.383.22.60.8

300 (1/3)0.0085.894.8100.0100.099.199.7100.0100.099.199.7100.0100.0
0.5086.292.899.9100.097.398.198.298.299.199.9100.0100.0
0.5586.091.7100.0100.096.496.997.097.099.499.9100.0100.0
0.6086.693.8100.0100.096.296.896.996.999.299.8100.0100.0
0.6587.893.9100.0100.095.796.096.296.299.599.8100.0100.0
0.7087.393.8100.0100.095.795.896.296.299.499.6100.0100.0
0.7588.794.699.999.995.696.296.496.499.299.8100.0100.0
0.8085.592.999.699.192.693.994.194.098.499.8100.099.9
0.9085.192.054.239.792.793.570.654.798.899.664.646.6

400 (1/4)0.0089.893.7100.0100.098.599.4100.0100.098.599.4100.0100.0
0.5091.596.6100.0100.098.899.599.899.899.099.7100.0100.0
0.5588.694.4100.0100.098.298.999.299.299.099.7100.0100.0
0.6092.095.9100.0100.097.698.999.199.198.599.8100.0100.0
0.6590.594.1100.0100.097.197.998.398.398.899.6100.0100.0
0.7091.196.4100.0100.098.198.699.199.199.099.5100.0100.0
0.7589.896.0100.0100.098.298.899.299.299.099.6100.0100.0
0.8088.594.1100.0100.097.598.698.898.898.799.8100.0100.0
0.9089.594.890.283.396.897.795.793.399.099.996.993.1

500 (1/5)0.0093.296.2100.0100.099.199.6100.0100.099.199.6100.0100.0
0.5093.797.4100.0100.098.699.7100.0100.098.699.7100.0100.0
0.5594.397.7100.0100.098.999.499.999.999.099.5100.0100.0
0.6092.796.2100.0100.098.099.299.899.898.299.4100.0100.0
0.6593.196.7100.0100.098.499.499.699.698.899.8100.0100.0
0.7094.697.7100.0100.099.099.7100.0100.099.099.7100.0100.0
0.7592.696.6100.0100.098.199.299.599.598.699.7100.0100.0
0.8092.996.2100.0100.098.399.199.499.498.999.7100.0100.0
0.9093.097.298.797.897.999.199.399.098.599.799.999.5

### Table 9

Selected predictor variables in the body fat data

CriteriaPredictor Variables

Age (yrs)Weight (lbs)Height (inches)Adiposity index (kg/m2)Fat Free Weight (lbs)Circumference measure (cm)

NeckChestAbdomenHipThighKneeAnkleBicepsForearmWrist
CRp Criterion

CRp4
CRp5
CRp6
CRp7

AIC
BIC
RAIC
RBIC

Kick-Off algorithm

KOp4
KOp5
KOp6
KOp7

Sequential algorithm

Sp4
Sp5
Sp6
Sp7

Stepwise algorithm

STp4
STp5
STp6
STp7

References
1. Akaike H (1973). Information theory and an extension of maximum likelihood principle. Proceedings of the Second International Symposium on Information Theory, Akademiai Kiado, Budapest, 267–281.
2. Birkes D and Dodge Y (1993). Alternative Methods of Regression, Wiley, New York.
3. Dielman TE (2005). Least absolute value regression: recent contributions. Journal of Statistical Computation and Simulation, 75, 263-286.
4. Dielman TE (2006). Variance estimates and hypothesis tests in least absolute value regression. Journal of Statistical Computation and Simulation, 76, 103-114.
5. Gilmour SG (1995). The interpretation of Mallows’s Cp-statistic. Journal of the Royal Statistical Society, Series D (The Statistician), 45, 49-56.
6. Kashid DN and Kulkarni SR (2002). A more general criterion for subset selection in multiple linear regression. Communications in Statistics - Theory and Methods, 31, 795-811.
7. Kim C and Hwang S (2000). Influence subsets on the variable selection. Communication in Statistics-Theory and Methods, 29, 335-347.
8. Machado JAF (1993). Robust model selection and M-estimation. Econometric Theory, 9, 478-493.
9. Mallows C (1973). Some comment on Cp. Technometrics, 15, 661-675.
10. Rao CR and Wu Y (1989). A strong consistent procedure for model selection in a regression model. Biometrika, 76, 369-374.
11. Rao C, Wu Y, and Konishi S, et al. (2001). On model selection. Lecture Notes-Monograph Series, 38, 1-64.
12. Ronchetti E (1985). Robust model selection in regression. Statistics and Probability Letters, 3, 21-23.
13. Ronchetti E and Staudte RG (1994). A robust version of Mallows’s Cp. Journal of the American Statistical Association, 89, 550-559.
14. Schwarz G (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.
15. Siniksaran E (2008). A geometric interpretation of Mallows’ Cp statistic and an alternative plot in variable selection. Computational Statistics and Data Analysis, 52, 3459-3467.
16. Tharmaratnam K and Claeskens G (2013). A comparison of robust versions of the AIC based on M, S and MM-estimators. Statistics: A Journal of Theoretical and Applied Statistics, 47, 216-235.
17. Yamashita T, Yamashita K, and Kamimura R (2007). A stepwise AIC method for variable selection in linear regression. Communication in Statistics-Theory and Methods, 36, 2395-2403.