In this article, we suggest the following approaches to simultaneous variable selection and outlier detection. First, we determine possible candidates for outliers using properties of an intercept estimator in a difference-based regression model, and the information of outliers is reflected in the multiple regression model adding mean shift parameters. Second, we select the best model from the model including the outlier candidates as predictors using stochastic search variable selection. Finally, we evaluate our method using simulations and real data analysis to yield promising results. In addition, we need to develop our method to make robust estimates. We will also to the nonparametric regression model for simultaneous outlier detection and variable selection.
In multiple linear regression models, the separated data point on the vertical axis is called an outlier that is different from the others in the data (Weisberg, 2004). These outliers have serious effects in inference and model selection (Kahng
Many approaches for the detection of outliers or selection of variables have been proposed. Most authors have focused on these problems separately; however, some authors proposed methods to perform outlier detection and variable selection simultaneously. For example, Hoeting
After that Kim
We use the mean-shift outlier model for outlier candidates. To determine these outlier candidates, we use the properties of an intercept estimator in the difference-based regression model (DBRM) (Choi
The remainder of this paper is organized as follows. In Section 2, after introducing the notation, we describe the mean-shift outlier model and the difference-based regression model (Park and Kim, 2018b) to determine the outlier candidates. In Section 3, we introduce the Bayesian variable selection, SSVS proposed by George and McCulloch (1993). Then, we propose a method that can simultaneously perform outlier detection and variable selection by using SSVS in regression model with outlier candidates. In Sections 4 and 5, we present simulations and a real data example, respectively. Finally, we provide conclusions and recommendations in Section 6.
In this section, we explain how to determine outlier candidates and set up a regression model that includes outlier candidates. We then introduce the difference-based regression model (Choi
Let
We consider the mean-shift outlier model (Belsley
where
In this paper, we use the mean-shift outlier model described above in order to include potential
where
In this paper, we use the mean-shift outlier model described above in order to include potential
where
Park and Kim (2018b) propose an outlier-detection approach that uses the properties of an intercept estimator in the difference-based regression model. This method uses only the estimated intercepts: it does not require estimating the other parameters in the DBRM. To identify whether if the observations are outliers, the DBRM uses a mean-shift outlier model.
In this paper, we describe the DBRM defined by Park and Kim (2018b) using
where −
where
Park and Kim (2018b) estimate intercepts, −
where the
where the
The
Estimate intercepts, −
Ascend them,
We assume that
In this section, we briefly describe the Bayesian variable selection and introduce stochastic search variable selection (George and McCulloch, 1993). Finally, we explain how to simultaneously detect outliers and select variables for the regression model
George and McCulloch (1993, 1997) assumed that the prior distributions of the regression coefficients are independent and expressed a prior distribution of each coefficient to be a mixture of two normal distributions. Both normal distributions are centered at 0 with one being a very small variance and the other being a very large variance.
Then we can describe stochastic search variable selection using
The value of
Our approach for simultaneous variable selection and outlier detection consists of two steps: the first step is to determine a set of outlier candidates using properties of an intercept in the DBRM described in Section 2.2. The second step is to perform SSVS of the mean-shift outlier model with the outlier candidates detected in step 1. Our approach is similar to that of Hoeting
To simultaneously perform outlier detection and variable selection, we use hierarchical model of
where
where
Using conjugate priors, it is easy to obtain posterior distributions. Thus Gibbs sampling procedures are easily implemented for calculating the posterior distributions. Accordingly, the posterior distributions as follows:
where
Therefore the best subset of variables is selected according to the information contained in the
We conduct simulations to evaluate the performance of our approach compared to that of another existing approach. We compare our approach with BayesVarSel. Therefore, we introduce BayesVarSel before simulation.
Donato and Forte (2017) introduce the R package, BayesVarSel which implements Bayesian methodology for hypothesis testing and variable selection in linear models. To perform the simulation compared to our method, we will use the variable selection in this package. The variable selection functions in this package are Bvs, PBvs, and GibbsBvs. Except for a few arguments the usage of the three functions is very similar.
This package implements the criteria-based priors of the regression coefficients proposed by Bayarri
We consider two cases of multiple regression:
We apply SSVS using the hyperparameter settings: prior correlation matrix
In all cases, 30,000 samples from the MCMC simulation are used to estimate the parameters, where the first 10,000 samples are discarded as burn-in. In addition, we confirm the convergence of the Markov chain by using Gelman-Rubin diagnostic (Gelman and Rubin, 1992); all the values are close to one.
For each case, we compare our approach with Bvs (
The performances of our proposed procedure are evaluated in two parts: outlier detection and variable selection. In the first part, we use three criteria proposed by Choi
In the second part, to select variables, we use two criteria (Choi,
Also, to compare the performance of our approach and BayesVarSel, we consider the following two models: the model with the highest probability model (HPM) and the model consisting of variables with inclusion probability greater than 0.5 (MPM) (Barbieri and Berger, 2004).
The assignment of prior probabilities
With regard to variable selection, Table 3 (
We now examine one data set in order to show the detailed procedures which lead into the detailed results. Consider the following model:
Accordingly, the dataset (
This section illustrate the performance of our method on Scottish Hill Racing data (Atkinson, 1986). This data set is used by Hoeting
To identify outlier candidates, we calculate the absolute values of the intercept estimators (
Accordingly, the dataset is sorted by the order of
High frequency models and estimation results of our example data are summarized in Table 7 and Table 8. The result of BayesVarSel is the same results; therefore,
In this paper, we have adopted the mean-shift outlier model in order to include information on outliers. This approach to modeling outliers is used by Kim
Accordingly, we suggest an alternative approach for simultaneous outlier detection and variable selection. First, by using properties of an intercept estimator in the DBRM (Park and Kim, 2018b), outlier candidates are determined and the information on outliers is reflected in the multiple regression model. Second, we select the best model from the model containing all variables including outlier candidates by using SSVS.
As shown in the simulation results and real data analysis, we have found that the performance of the proposed method is good under proper conditions. Furthermore, there is an advantage that the relative sizes of outliers can be confirmed from statistics of results. However, our method is affected by the constant value such as
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2018R1D1A1B070).
Priors used in BayesVarSel package
(a) Prior probabilities for models | - prior.models = “ScottBerger” (default) - prior.models = “Constant” - prior.models = “User”, priorprobs |
(b) Prior probabilities for the coefficients | - prior.betas = “Robust” (default) - prior.betas = “ZellnerSiow” - prior.betas = “gZellner” - prior.betas = “FLS” - prior.betas = “Liangetal” |
(c) Null model contains just the intercept | - fixed.cov = c(“Intercept”) (default) - fixed.cov = NULL |
Criteria of outlier detection
True | Sum | |||
---|---|---|---|---|
Outlier | Non-outlier | |||
Detection | Outlier | |||
Non-outlier | ||||
Sum |
Results on simulated data with
Method | Variable selection | Outlier detection | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CS | AN | Max | Mode | PD | PD + OS | AN | Max | Mode | ||||
30 | SSVS | 0.90 | 2.13 | 4 | 2 | 0.88 | 0.99 | 3.12 | 5 | 3 | ||
0.88 | 2.15 | 4 | 2 | 0.76 | 0.98 | 3.26 | 5 | 3 | ||||
0.91 | 2.07 | 4 | 2 | 0.93 | 0.99 | 3.08 | 6 | 3 | ||||
0.86 | 2.20 | 4 | 2 | 0.88 | 0.99 | 3.16 | 6 | 3 | ||||
0.93 | 2.07 | 4 | 2 | 0.98 | 0.99 | 3.00 | 4 | 3 | ||||
0.90 | 2.13 | 4 | 2 | 0.93 | 0.99 | 3.08 | 6 | 3 | ||||
BayesVarSel | FLS | 0.91 | 2.10 | 4 | 2 | 0.79 | 0.99 | 3.22 | 5 | 3 | ||
gZellner | 0.93 | 2.08 | 4 | 2 | 0.82 | 0.99 | 3.18 | 5 | 3 | |||
Liangetal | 0.91 | 2.10 | 4 | 2 | 0.80 | 0.99 | 3.21 | 5 | 3 | |||
Robust (default) | 0.91 | 2.10 | 4 | 2 | 0.76 | 0.99 | 3.26 | 5 | 3 | |||
ZellnerSiow | 0.91 | 2.10 | 4 | 2 | 0.80 | 0.99 | 3.21 | 5 | 3 | |||
50 | SSVS | 0.94 | 2.07 | 4 | 2 | 0.69 | 0.95 | 5.30 | 8 | 5 | ||
0.90 | 2.13 | 4 | 2 | 0.53 | 0.95 | 5.58 | 9 | 5 | ||||
0.95 | 2.05 | 3 | 2 | 0.96 | 1.00 | 5.04 | 6 | 5 | ||||
0.93 | 2.08 | 4 | 2 | 0.91 | 0.99 | 5.10 | 7 | 5 | ||||
0.97 | 2.02 | 4 | 2 | 0.95 | 0.95 | 4.95 | 5 | 5 | ||||
0.95 | 2.05 | 3 | 2 | 0.92 | 0.95 | 4.98 | 6 | 5 | ||||
BayesVarSel | FLS | 0.95 | 2.05 | 3 | 2 | 0.49 | 0.95 | 5.64 | 9 | 5 | ||
gZellner | 0.94 | 2.07 | 4 | 2 | 0.49 | 0.95 | 5.64 | 9 | 5 | |||
Liangetal | 0.94 | 2.07 | 4 | 2 | 0.48 | 0.95 | 5.66 | 9 | 5 | |||
Robust (default) | 0.94 | 2.07 | 4 | 2 | 0.46 | 0.95 | 5.70 | 9 | 5 | |||
ZellnerSiow | 0.94 | 2.07 | 4 | 2 | 0.47 | 0.95 | 5.67 | 9 | 5 | |||
100 | SSVS | 0.93 | 2.08 | 4 | 2 | 0.70 | 0.99 | 10.31 | 12 | 10 | ||
0.96 | 2.04 | 3 | 2 | 0.65 | 0.99 | 10.41 | 12 | 10 | ||||
0.97 | 2.03 | 3 | 2 | 0.93 | 0.97 | 10.02 | 11 | 10 | ||||
0.99 | 2.01 | 3 | 2 | 0.92 | 0.97 | 10.03 | 11 | 10 | ||||
0.96 | 2.04 | 3 | 2 | 0.99 | 0.99 | 9.99 | 10 | 10 | ||||
0.97 | 2.04 | 4 | 2 | 0.98 | 0.99 | 10.09 | 20 | 10 | ||||
BayesVarSel | FLS | 0.97 | 2.03 | 3 | 2 | 0.41 | 0.99 | 10.95 | 14 | 10 | ||
gZellner | 0.89 | 2.11 | 3 | 2 | 0.21 | 0.99 | 11.60 | 16 | 11 | |||
Liangetal | 0.90 | 2.10 | 3 | 2 | 0.27 | 0.99 | 11.42 | 16 | 10 | |||
Robust (default) | 0.90 | 2.10 | 3 | 2 | 0.25 | 0.99 | 11.51 | 16 | 11 | |||
ZellnerSiow | 0.90 | 2.10 | 3 | 2 | 0.27 | 0.99 | 11.42 | 16 | 10 |
Results on simulated data with
Method | Variable selection | Outlier detection | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CS | AN | Max | Mode | PD | PD + OS | AN | Max | Mode | ||||
30 | SSVS | 0.73 | 3.44 | 7 | 3 | 0.73 | 0.91 | 3.14 | 6 | 3 | ||
0.71 | 3.47 | 6 | 3 | 0.55 | 0.91 | 3.40 | 6 | 3 | ||||
0.70 | 3.27 | 7 | 3 | 0.81 | 0.87 | 2.94 | 4 | 3 | ||||
0.66 | 3.47 | 6 | 3 | 0.76 | 0.87 | 3.09 | 6 | 3 | ||||
0.78 | 3.20 | 6 | 3 | 0.89 | 0.91 | 2.94 | 4 | 3 | ||||
0.70 | 3.48 | 9 | 3 | 0.85 | 0.90 | 3.00 | 4 | 3 | ||||
BayesVarSel | FLS | 0.82 | 3.12 | 7 | 3 | 0.66 | 0.91 | 3.20 | 6 | 3 | ||
gZellner | 0.88 | 3.01 | 4 | 3 | 0.82 | 0.91 | 3.02 | 5 | 3 | |||
Liangetal | 0.84 | 3.06 | 5 | 3 | 0.68 | 0.91 | 3.18 | 6 | 3 | |||
Robust (default) | 0.81 | 3.14 | 7 | 3 | 0.66 | 0.91 | 3.22 | 6 | 3 | |||
ZellnerSiow | 0.84 | 3.06 | 5 | 3 | 0.68 | 0.91 | 3.19 | 6 | 3 | |||
50 | SSVS | 0.87 | 3.20 | 6 | 3 | 0.79 | 0.94 | 5.10 | 7 | 5 | ||
0.77 | 3.37 | 8 | 3 | 0.57 | 0.94 | 5.43 | 7 | 5 | ||||
0.87 | 3.17 | 6 | 3 | 0.92 | 0.93 | 4.95 | 6 | 5 | ||||
0.79 | 3.28 | 6 | 3 | 0.90 | 0.93 | 5.00 | 7 | 5 | ||||
0.89 | 3.16 | 6 | 3 | 0.94 | 0.94 | 4.93 | 5 | 5 | ||||
0.87 | 3.23 | 7 | 3 | 0.92 | 0.94 | 4.95 | 6 | 5 | ||||
BayesVarSel | FLS | 0.88 | 3.10 | 5 | 3 | 0.64 | 0.94 | 5.27 | 7 | 5 | ||
gZellner | 0.90 | 3.10 | 6 | 3 | 0.66 | 0.94 | 5.26 | 7 | 5 | |||
Liangetal | 0.88 | 3.12 | 6 | 3 | 0.62 | 0.94 | 5.31 | 7 | 5 | |||
Robust (default) | 0.85 | 3.15 | 6 | 3 | 0.62 | 0.94 | 5.32 | 7 | 5 | |||
ZellnerSiow | 0.88 | 3.12 | 6 | 3 | 0.62 | 0.94 | 5.31 | 7 | 5 | |||
100 | SSVS | 0.89 | 3.14 | 5 | 3 | 0.73 | 0.96 | 10.21 | 13 | 10 | ||
0.82 | 3.29 | 9 | 3 | 0.58 | 0.95 | 10.35 | 12 | 10 | ||||
0.87 | 3.20 | 6 | 3 | 0.95 | 0.97 | 9.99 | 11 | 10 | ||||
0.91 | 3.13 | 6 | 3 | 0.95 | 0.97 | 9.99 | 11 | 10 | ||||
0.92 | 3.09 | 5 | 3 | 0.96 | 0.96 | 9.96 | 10 | 10 | ||||
0.94 | 3.07 | 5 | 3 | 0.95 | 0.96 | 9.97 | 11 | 10 | ||||
BayesVarSel | FLS | 0.92 | 3.08 | 4 | 3 | 0.45 | 0.96 | 10.73 | 15 | 10 | ||
gZellner | 0.85 | 3.17 | 5 | 3 | 0.31 | 0.96 | 11.08 | 15 | 11 | |||
Liangetal | 0.85 | 3.17 | 5 | 3 | 0.31 | 0.96 | 11.09 | 15 | 11 | |||
Robust (default) | 0.85 | 3.17 | 5 | 3 | 0.31 | 0.96 | 11.10 | 15 | 11 | |||
ZellnerSiow | 0.85 | 3.17 | 5 | 3 | 0.31 | 0.96 | 11.09 | 15 | 11 |
High frequency models simulated data using SSVS
Model | Index set | Probability | |
---|---|---|---|
Model selection | Outlier detection | ||
1 | { |
{ |
0.99965 |
2 | { |
{ |
0.00005 |
3 | { |
{ |
0.00010 |
4 | { |
{ |
0.00010 |
5 | { |
{ |
0.00005 |
6 | { |
{ |
0.00005 |
Estimation results simulated data using SSVS
Parameter | Quantile (95%) | Median | Mean | sd | Inclusion probability | ||
---|---|---|---|---|---|---|---|
2.5% | 97.5% | ||||||
Variable selection | 0.816 | 1.589 | 1.201 | 1.202 | 0.196 | 0.9999 | |
1.175 | 2.067 | 1.623 | 1.623 | 0.227 | 0.99995 | ||
−0.071 | 0.820 | 0.354 | 0.359 | 0.227 | 0 | ||
−0.487 | 0.227 | −0.121 | −0.122 | 0.182 | 0 | ||
Outlier detection | −1.149 | −0.639 | −0.896 | −0.895 | 0.129 | 0.99995 | |
−1.180 | −0.681 | −0.932 | −0.932 | 0.125 | 1 | ||
−1.173 | −0.682 | −0.927 | −0.928 | 0.125 | 0.9999 | ||
−1.612 | −1.123 | −1.368 | −1.368 | 0.124 | 1 | ||
1.140 | 1.691 | 1.418 | 1.418 | 0.140 | 0.99995 | ||
−0.151 | 0.316 | 0.085 | 0.084 | 0.119 | 0 | ||
−0.097 | 0.397 | 0.154 | 0.153 | 0.126 | 0 | ||
−0.433 | 0.068 | −0.185 | −0.183 | 0.127 | 0 | ||
0.000 | 0.480 | 0.239 | 0.240 | 0.122 | 0 | ||
0.090 | 0.614 | 0.353 | 0.352 | 0.134 | 0 |
High frequency models in our example data using SSVS
Model | Index set | Probability | |
---|---|---|---|
Model selection | Outlier detection | ||
1 | { |
{ |
0.9996 |
2 | { |
{ } | 0.0003 |
3 | { |
{ |
0.0001 |
Estimation Results in our example data using SSVS
Parameter | Quantile (95%) | Median | Mean | Sd | Inclusion probability | ||
---|---|---|---|---|---|---|---|
2.5% | 97.5% | ||||||
Variable selection | 21.045 | 35.307 | 28.155 | 28.177 | 3.621 | 0.9999 | |
18.938 | 29.414 | 24.186 | 24.178 | 2.643 | 1 | ||
Outlier detection | −5.043 | 0.273 | −2.429 | −2.414 | 1.354 | 0 | |
−2.277 | 3.543 | 0.620 | 0.609 | 1.474 | |||
−1.299 | 4.622 | 1.697 | 1.684 | 1.502 | 0 | ||
−5.038 | 0.324 | −2.361 | −2.362 | 1.365 | 0 | ||
−6.267 | −0.638 | −3.527 | −3.512 | 1.419 | 0 | ||
8.725 | 14.084 | 11.377 | 11.386 | 1.358 | 0.9997 | ||
−0.476 | 9.577 | 4.590 | 4.574 | 2.558 | 0 |