Selecting variables and detecting outliers are important issues in the field of regression diagnostics. Many methods are suggested focusing on these issues separately. However it is well known that the process of model selection depends on the order in which the variable selection and outlier identification are performed (Adams, 1991; Blettner and Sauerbrei, 1993; Hoeting
Model comparisons are complicated in the presence of outliers. If outlier detection methods are applied and some observations are excluded from the data in estimating models, two models usually become non-nested models. The likelihood ratio test for non-nested models is not available because the exact distribution of the test statistic depends on unknown parameters. Many approaches have been suggested for testing non-nested models (Efron, 1984; Vuong, 1989; Royston and Thompson, 1995; Godfrey, 1998). Most of them use asymptotic or empirical distribution of testing statistics under the general conditions. As an alternative to tests, the measures quantifying the relative quality of models for a given set of data can be compared. Akaike information criterion (AIC) (Akaike, 1973) uses log likelihood, adds a penalty for the number of parameters and permits comparison between non-nested models. Bayesian information criterion (BIC) (Schwarz, 1978) is suggested to overcome the inconsistency of AIC. Both AIC and BIC are not appropriate for comparing models for different sizes of data. They are biased and have a tendency to increase with a large size of data.
There have been many methods proposed for variable selection and outlier identification. Some methods often consider identified outliers as potential outliers. (Hoeting
We suggest a method which does not need to set potential outliers in advance by detecting outliers in each model of all possible subset regressions. The sequential method of Hadi and Simonoff (1993) is used to identify outliers, which is not computationally intensive and resistant to masking and swamping effects. It starts with an initial clean subset of observations and increases the size of the clean subset until the remaining observations are declared as outliers. Outlyingness on a subset corresponding to the observations of the minimum residual sum of squared is determined by t-test using the order statistic of internally studentized residuals. A unified procedure for variable selection and outlier detection is outlined as follows.
Set up a one-predictor model and perform Hadi and Simonoff’s procedure to identify outliers.
Fit the model again after deleting the identified outliers and calculate adjusted-
Repeat the step 1 and 2 for all one-predictor models.
Determine the best one-predictor model based on adjusted-
Find the best
Finally determine the best model among the
We provide several examples with the same data sets as those used by Kim
The Stackloss data (Brownlee, 1965) have 21 observations from a plant for the oxidation of ammonia as a stage in the production of nitric acid. They consist of three independent variables,
Scottish Hill racing data (Atkinson, 1986) includes two independent variables
The Modified Wood Gravity data are a five-predictor data set and are contaminated by replacing cases 4, 6, 8 and 19 with outliers (Rousseeuw, 1984). Observations (4, 6, 7, 8, 11, 19) are identified as potential outliers by many methods. We use a set of observations (4, 6, 7, 8, 11, 19) as an initial outliers in the Hadi-Simonoff’s procedure. Table 3 shows the results of all possible models. Based on adjusted-
Performing all possible regressions needs a lot of computations. We suggest a sequential procedure for simultaneous variables selection and outliers detection to reduce computational work. The sequential procedure finds the best model of the next stage based on the current best model. Each variable not involved in the current best model is added to the current best model and then adjusted-
Let
Given
Step 1. Each variable not involved in
The provisional best set of
Let
where
Step 2. Exchange a variable in the provisional best model with a variable not in the provisional best model and then adjusted-
Let
The final best set of
where
The sequential approach requires
Several examples are provided using the same data sets as those used by Kim
The sequential procedure is applied to the Mortality data which is available in the Data and Story Library of StatLib (StatLib, 1996). The response variable is age-adjusted mortality, and the potential predictors are 14 variables measuring demographic characteristics of the cities, climate characteristics or recording the pollution potential of air pollutants. The Mortality data are also used by McCann and Welch (2007). McCann andWelch (2007) suggested to append dummy variable identity matrix to the design matrix for robustness and to use LARS for variable selection. They proposed three algorithms. The first algorithm, called “LARSD-T”, fits a LS regression on the variables provided by LARS and selects the final model using t-statistics for each variable. McCann andWelch (2007)’s other algorithm starts by taking a sample from the data and selecting variables from LARS. The nested models of the selected variables are constructed by the LARS-ordering and the MAD of the residuals from the LARS are calculated for each model. These steps are repeated a large number of times and the best 1% models based on MAD are selected. A variables is finally selected if it appears in 50% or more of the top 1% best models (LARS-CV). The same algorithm as LARS-CV using the data appended by dummy variables is “LARSD-CV”, which was designed to get samples free of leverage outliers and to detect additive outliers.
Tables 5 contains the results of McCann and Welch (2007)’s algorithms and the sequential procedure applied to the Mortality data. Three robust algorithms, LARS-CV, LARSD-CV and LARSD-T selected variables, (
Another meaningful comparison is to check how the best models would score on a clean data. The clean version of the Mortality data is obtained by removing 9 points that had a standardized residual with a magnitude of 2.5 or larger after any of the LTS. McCann and Welch (2007) reported the performance of the robust algorithms on this cleaned set. For the comparison with McCann and Welch’s methods, the size of significance is adjusted in the sequential procedure to get the best model with 9 outliers. Table 6 contains the performance of methods on a clean version of Mortality data using
A unified procedure for variable selection and outlier detection has been provided using all possible subset regressions. The suggested procedure does not depend on the order in which variable selection and outlier identification are performed. It does not require presetting potential outliers prior to selecting variables. Examples with several “benchmark” data sets showed that the proposed method is effective in selecting variables when outliers exist.
A sequential procedure is also suggested that does not investigate all possible models if
This paper was supported by Konkuk University in 2018.
All possible subset regressions for the Stackloss data
Variables | Outliers | Adjusted- |
---|---|---|
4, 21 | 0.9556* | |
0.7542 | ||
1, 2, 3, 4 | 0.2640 | |
1, 3, 4, 21 | 0.9688* | |
4, 21 | 0.9553 | |
0.7449 | ||
1, 3, 4, 21 | 0.9692** |
^{*}the best model among
^{**}the best model overall.
All possible subset regressions for the Scottish Hill racing data
Variables | Outliers | Adjusted- |
---|---|---|
7, 18, 33 | 0.9593* | |
7, 11, 17, 18, 33, 35 | 0.6825 | |
7, 18, 33 | 0.9855** |
^{*}the best model among
^{**}the best model overall.
All possible subset regressions for the Wood Gravity data
Variables | Outliers | Adjusted- |
Variables | Outliers | Adjusted- |
---|---|---|---|---|---|
0.3621 | 0.4027 | ||||
0.3781 | 0.2210 | ||||
4, 6, 8, 19 | 0.5386* | ||||
0.6699* | 0.4715 | ||||
4, 6, 8, 19 | 0.6577 | 0.3883 | |||
0.5137 | 4, 6, 8, 19 | 0.5044 | |||
0.3562 | 4, 6, 8, 19 | 0.5711 | |||
4, 6, 8, 19 | 0.6213 | 0.5159 | |||
0.7499 | 0.5275 | ||||
0.6586 | 4, 6, 8, 19 | 0.5906 | |||
0.6836 | 4, 6, 8, 19 | 0.6716 | |||
0.5489 | 0.5101 | ||||
4, 6, 8, 19 | 0.7517* | 4, 6, 8, 19 | 0.6699 | ||
0.7451 | 4, 6, 8, 19 | 0.9421** | |||
0.7570 | 4, 6, 8, 19 | 0.7918 | |||
0.6630 | |||||
4, 6, 8, 19 | 0.9375* |
Observations (4, 6, 7, 8, 11, 19) were used as an initial set of outliers in the Hadi-Simonoff’s procedure.
^{*}the best model among
^{**}the best model overall.
Results of the sequential procedure applied to the Wood Gravity data
Variables | Outliers | Adjusted- |
---|---|---|
4, 6, 8, 19 | 0.5386* | |
0.6699* | ||
4, 6, 8, 19 | 0.7517* | |
4, 6, 8, 19 | 0.9421** | |
4, 6, 8, 19 | 0.9375* |
^{*}the best model among
^{**}the best model overall.
Models selected by various algorithms applied to the Mortality data
Algorithm | Variables selected | MAD |
---|---|---|
LARS-CV | 24.40 | |
LARSD-CV | 19.15 | |
LARSD-T | 26.88 | |
Sequential Proc | 18.85 |
Part of results adopted from McCann and Welch (2007).
Best subsets scores on cleaned data of the Mortality data
Algorithm | Parameters | AIC | BIC | |
---|---|---|---|---|
LARS-CV | 6 | 7.82 | 313.24 | 316.32 |
LARSD-CV | 8 | 8.74 | 313.77 | 318.43 |
LARSD-T | 4 | 22.15 | 326.07 | 326.21 |
Sequential Proc | 6 | 5.84 | 298.10 | 311.50 |
Part of results adopted from McCann and Welch (2007).
AIC = Akaike information criterion; BIC = Bayesian information criterion.