
We consider a regression model with
where
Many methods resistant to masking effect and swamping effect are suggested. Marasinghe (1985) suggested a multistage procedure to improve the computational complexity required to obtain the most likely outlier subset. A generalized extreme studentized residual (GESR) procedure was proposed by Paul and Fung (1991). Kianifard and Swallow (1989, 1990) proposed to use test statistics based on recursive residuals for identifying outliers. Atkinson (1994) suggested a fast method for the detection of multiple outliers by using forward search. Pena and Yohai (1999) proposed a method which is fast and computationally feasible for a very large value of
One of general approaches for outlier detection is separating the data into a clean subset that is presumably free of outliers and a subset that contains all potential outliers. This approach is usually composed of several stages such as construction of a basic clean subset, calculation of residuals, outliers test and construction of a new clean subset. Hadi and Simonoff’s procedures (Hadi and Simonoff, 1993) adopted this approach and are summarized as follows.
A clean subset
Let |
The main purpose of this paper is to develop outliers detection methods which is invulnerable to masking and swamping effects. Newly suggested methods use least quantile squares (LQS) estimates for dividing the data into two groups, a clean set and an outlier set. LQS estimates are a generalization of LMS estimates. They have not been used as much as LMS because their breakdown points become small as the quartile value increases. But if the size of outliers is assumed to be fixed LQS estimates yield a good fit to the majority of data and residuals calculated from LQS estimates can be a reliable tool to detect outliers. In Section 2 the proposed methods are described with some examples. In Section 3 summary and concluding comments are presented.
For detecting outliers it is important to estimate a model immune from the influence of the outliers. The widely used estimator of parameters in the regression model is the LS estimator which involves the minimization of the sum of squared residuals. The drawback of the LS estimator is that it is sensitive to outliers in the data. A useful measure of the robustness of an estimator is the breakdown point of the estimator. Roughly speaking, it is the smallest fraction of arbitrary contamination that can cause the estimator to be infinite. For least squares, its breakdown points is 1
Many robust estimators are suggested in the context of breakdown point. The most used high breakdown estimate is the LMS estimator, given by
where
We define
where
The computation of
We now propose new procedures for the identification of outliers in the regression by using
We suggest to use LQS residuals for constructing a clean set of Mand then to use LS residuals
Three newly suggested procedures denoted as S1, S2, and S3 respectively are described as follows. Let “M1” denote Hadi and Simonoff’s procedure using the first method for building a basic clean subset.
S1: This method differ from M1 only in finding the basic clean subset. The basic clean subset is the set of observations with the smallest squared residuals calculated from
The second method uses LQS estimates for constructing both a basic clean subset and new clean subsets.
S2: The basic clean subset is obtained by the same method used in S1. After outlyingness test, if needed, a new clean subset of size
As a forward searching method neither LS, used in M1 nor LQS is perfect to avoid masking and swamping. The third method is designed to be relatively resistant to both masking and swamping effects which may occur in the process of S2. Let
S3: Find a basic clean subset by using LQS estimates, then compute and order residuals according to (
We now show an example to illustrate the proposed procedures. Columns
When columns
A Monte Carlo experiment for comparing the power of M1, S1, S2, and S3 are performed. We use the same outlier patterns and simulation scheme as used in Hadi and Simonoff (1993). The simulation results are limited because the pattern of the predictor, the position of outliers and the sample size are fixed. The data sets are generated from the model
The first column of Table 4 indicates models.
The first row of Table 4 shows that all methods control the size of Type I error at 0.05 approximately. S1, S2, and S3 provided better overall performance than M1.
S3 gave the best results among the examined methods. S2 and S3 had a similar performance. The comparison of H3 and H3(s) shows that the difference in performance among M1, S1, S2 and S3 becomes larger if the ordinary observations are generated from the model with a small variance. The model H3h2(s) used a data set containing a group of 3 high leverage outliers and an additional group of 2 outliers positioned at lower right corner. The
Hadi and Simonoff’s procedure and the suggested methods have low breakdowns as
Modified wood gravity data set (Rousseeuw and Leroy, 1987) contains 20 observations of five predictors including four outliers. M1, S1, S2 and S3 succeed in identifying true outliers. Hawkins, Bradu and Kass data set (Hawkins
In this article we have proposed three procedures for the identification of the outliers using
As compared with other methods the LQS approach needs more computation. But in large regression problems a subsampling algorithm (Rousseeuw and Leroy, 1987) for approximating LQS estimates can be used to avoid excessive computation.
Artificial data
No. | No. | ||||||||
---|---|---|---|---|---|---|---|---|---|
1 | −4.00 | −4.00 | 0.00 | 0.00 | 14 | 9.87 | 9.87 | 10.11 | 10.11 |
2 | 20.00 | 20.00 | 26.00 | 24.00 | 15 | 2.55 | 2.55 | 3.03 | 3.03 |
3 | 19.80 | 19.90 | 25.90 | 23.90 | 16 | 7.51 | 7.51 | 6.86 | 6.86 |
4 | 19.60 | 19.80 | 25.80 | 23.80 | 17 | 2.67 | 2.67 | 2.10 | 2.10 |
5 | −5.00 | −5.00 | −11.00 | −9.00 | 18 | 4.40 | 4.40 | 3.74 | 3.74 |
6 | −4.80 | −4.90 | −10.90 | −8.90 | 19 | 7.65 | 7.65 | 7.57 | 7.57 |
7 | −4.60 | −4.80 | −10.80 | −8.80 | 20 | 7.01 | 7.01 | 6.40 | 6.40 |
8 | 11.36 | 11.36 | 11.10 | 11.10 | 21 | 1.28 | 1.28 | 1.05 | 1.05 |
9 | 11.66 | 11.66 | 11.92 | 11.92 | 22 | 4.48 | 4.48 | 4.72 | 4.72 |
10 | 0.20 | 0.20 | −0.27 | −0.27 | 23 | 8.73 | 8.73 | 9.39 | 9.39 |
11 | 5.27 | 5.27 | 4.95 | 4.95 | 24 | 4.36 | 4.36 | 4.63 | 4.63 |
12 | 10.52 | 10.52 | 11.83 | 11.83 | 25 | 5.47 | 5.47 | 6.04 | 6.04 |
13 | 6.16 | 6.16 | 6.34 | 6.34 |
Results of M1, S1 and S2 for the dataset of
Method | M1, S1 | S2 | |||
---|---|---|---|---|---|
Subset of Potential Outliers ( | Test | Subset of Potential Outliers ( | Test | ||
12 | 13 | (1 8 9 10 14 15 16 17 21 22 24 25) | N | (1 8 9 10 13 15 16 17 21 22 24 25) | N |
11 | 14 | (1 8 9 10 15 16 17 21 22 24 25) | N | (1 2 3 4 5 6 7 12 15 23 25) | N |
10 | 15 | (1 8 9 10 15 17 21 22 24 25) | N | (1 2 3 4 5 6 7 12 15 23) | N |
9 | 16 | (1 8 9 10 15 17 21 22 24 25) | N | (1 2 3 4 5 6 7 12 16) | N |
8 | 17 | (1 8 9 10 15 21 22 24 25) | N | (1 2 3 4 5 6 7 12) | N |
7 | 18 | (1 8 9 10 15 21 22) | N | (1 8 9 10 15 21 24) | N |
6 | 19 | (1 8 9 10 15 21) | N | (1 8 9 10 15 21) | N |
5 | 20 | (1 8 10 15 21) | N | (1 8 10 15 21) | N |
4 | 21 | (1 8 15 21) | N | (1 8 10 15) | N |
3 | 22 | (1 8 15) | N | (1 8 15) | N |
2 | 23 | (1 8) | N | (1 8) | N |
1 | 24 | (1) | Y | (1) | Y |
Results of M1 using
Subset of potential outliers ( | Test | ||
---|---|---|---|
11 | 14 | (1 2 3 4 5 6 7 12 15 23 25) | N |
10 | 15 | (1 2 3 4 5 6 7 12 15 23) | N |
9 | 16 | (1 2 3 4 5 6 7 12 25) | N |
8 | 17 | (1 2 3 4 5 6 7 12) | N |
7 | 18 | (1 2 3 4 5 6 7) | N |
Summary of Monte Carlo simulation
Model | Prob. | Estimators | Model | Prob. | Estimators | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
S1 | S2 | S3 | M1 | S1 | S2 | S3 | M1 | ||||
Null | 0.057 | 0.049 | 0.049 | 0.056 | Null | 0.057 | 0.049 | 0.049 | 0.056 | ||
H1 | 0.599 | 0.595 | 0.610 | 0.570 | H5 | 0.352 | 0.363 | 0.417 | 0.347 | ||
0.663 | 0.662 | 0.671 | 0.621 | 0.396 | 0.402 | 0.452 | 0.383 | ||||
0.069 | 0.074 | 0.063 | 0.054 | 0.139 | 0.116 | 0.089 | 0.125 | ||||
L1 | 0.762 | 0.772 | 0.782 | 0.756 | L5 | 0.688 | 0.681 | 0.696 | 0.675 | ||
0.834 | 0.831 | 0.839 | 0.834 | 0.755 | 0.741 | 0.756 | 0.746 | ||||
0.072 | 0.059 | 0.057 | 0.078 | 0.071 | 0.063 | 0.063 | 0.089 | ||||
H3 | 0.540 | 0.512 | 0.558 | 0.490 | H3(s) | 0.849 | 0.755 | 0.866 | 0.751 | ||
0.600 | 0.575 | 0.598 | 0.539 | 0.923 | 0.817 | 0.927 | 0.811 | ||||
0.080 | 0.065 | 0.051 | 0.066 | 0.078 | 0.067 | 0.061 | 0.061 | ||||
L3 | 0.717 | 0.723 | 0.727 | 0.716 | H3h2(s) | 0.671 | 0.483 | 0.734 | 0.476 | ||
0.788 | 0.779 | 0.781 | 0.786 | 0.763 | 0.554 | 0.809 | 0.542 | ||||
0.069 | 0.056 | 0.054 | 0.074 | 0.107 | 0.083 | 0.086 | 0.080 |