TEXT SIZE

CrossRef (0)
Least quantile squares method for the detection of outliers

Han Son Seoa, Min Yoon1,b

aDepartment of Applied Statistics, Konkuk University, Korea;
bDepartment of Applied Mathematics, Pukyong National University, Korea
Correspondence to: 1Department of Applied Mathematics, Pukyong National University, 45, Yongso-ro, Nam-Gu, Busan 48513, Korea.
E-mail: myoon@pknu.ac.kr

This paper was supported by Konkuk University in 2020.
Received October 7, 2020; Revised November 4, 2020; Accepted November 9, 2020.
Abstract
k-least quantile of squares (k-LQS) estimates are a generalization of least median of squares (LMS) estimates. They have not been used as much as LMS because their breakdown points become small as k increases. But if the size of outliers is assumed to be fixed LQS estimates yield a good fit to the majority of data and residuals calculated from LQS estimates can be a reliable tool to detect outliers. We propose to use LQS estimates for separating a clean set from the data in the context of outlyingness of the cases. Three procedures are suggested for the identification of outliers using LQS estimates. Examples are provided to illustrate the methods. A Monte Carlo study show that proposed methods are effective.
Keywords : influential case, least quantile squares, outliers, masking and swamping effects, regression diagnostics
1. Introduction

We consider a regression model with p independent variables (including the constant if there is an intercept)

$Y=Xβ+ɛ,$

where Y is an n × 1 vector of response, β is a p × 1 vector of unknown vector, X is an n × p matrix predictors, and ɛ is an vector of random errors with zero mean and variance matrix is σ2In. Linear regression method has been commonly used to analyze data because of its simplicity and applicability. However, this methodology is susceptible to disturbing effect caused by some outliers contained in the data. Therefore, the detection and examination of outliers or influential cases are important parts of data analysis. In detecting multiple outliers, masking phenomenon or swamping phenomenon usually occurs and makes it difficult to find true outliers.

Many methods resistant to masking effect and swamping effect are suggested. Marasinghe (1985) suggested a multistage procedure to improve the computational complexity required to obtain the most likely outlier subset. A generalized extreme studentized residual (GESR) procedure was proposed by Paul and Fung (1991). Kianifard and Swallow (1989, 1990) proposed to use test statistics based on recursive residuals for identifying outliers. Atkinson (1994) suggested a fast method for the detection of multiple outliers by using forward search. Pena and Yohai (1999) proposed a method which is fast and computationally feasible for a very large value of p. Jajo (2005) reviewed and compared outlier detection methods based on residuals.

One of general approaches for outlier detection is separating the data into a clean subset that is presumably free of outliers and a subset that contains all potential outliers. This approach is usually composed of several stages such as construction of a basic clean subset, calculation of residuals, outliers test and construction of a new clean subset. Hadi and Simonoff’s procedures (Hadi and Simonoff, 1993) adopted this approach and are summarized as follows.

A clean subset M of size s is constructed. The size of basic clean subset is s = int[(n + p − 1)/2]. Residuals di’s are calculated after fitting the model to the subset M. Residuals di’s correspond to the internally studentized residuals for the cases included in M and the scaled prediction errors for others:

$di={yi-xiTβ^Mσ^M1-xiT(XMTXM)-1 xi,if i∈M,yi-xiTβ^Mσ^M1+xiT(XMTXM)-1 xi,if i∉M.$

Let |d|(j) be the jth order statistic of the |di| in ascending order. If |d|(s+1)t(α/2(s+1),sk), then the last (ns) ordered cases are declared as outliers and the procedure stops. Otherwise, the procedure is repeated with a new clean subset consisting of the first (s+1) ordered observations. If n = s+1 no outlier is declared. Two methods for constructing the basic clean subset were suggested. The first method initially fits a model to the full data by using least squares (LS) estimate, orders the observations by their standardized residuals and takes the first p ordered observations as the early basic subset. After fitting a model to the early basic subset a new basic subset is formed by taking the first (p + 1) observations ordered by the residuals corresponding to (1.1). The basic subset is increased systematically until the size of the subset equals to int[(n + p − 1)/2]. The second method takes the backward-stepping approach (Rosner, 1975; Simonoff, 1988) based on the single linkage clustering (Hartigan 1981). A simulation study showed that the first method was more effective. Hadi and Simonoff (1993) demonstrated that their forward searching procedures are very effective and less affected by masking or swamping effects in comparison with other methods such as Marasinghe’s multistage procedure (1985), Paul and Fung’s (1991), and Kianifard and Swallow’s (1989, 1990) GESR procedure, OLS, and some high-breakdown methods like least median of squares (LMS) (Rousseeuw, 1984) and MM (Yohai, 1987).

The main purpose of this paper is to develop outliers detection methods which is invulnerable to masking and swamping effects. Newly suggested methods use least quantile squares (LQS) estimates for dividing the data into two groups, a clean set and an outlier set. LQS estimates are a generalization of LMS estimates. They have not been used as much as LMS because their breakdown points become small as the quartile value increases. But if the size of outliers is assumed to be fixed LQS estimates yield a good fit to the majority of data and residuals calculated from LQS estimates can be a reliable tool to detect outliers. In Section 2 the proposed methods are described with some examples. In Section 3 summary and concluding comments are presented.

2. k-least quantile of squares estimates and outlier detection procedures

For detecting outliers it is important to estimate a model immune from the influence of the outliers. The widely used estimator of parameters in the regression model is the LS estimator which involves the minimization of the sum of squared residuals. The drawback of the LS estimator is that it is sensitive to outliers in the data. A useful measure of the robustness of an estimator is the breakdown point of the estimator. Roughly speaking, it is the smallest fraction of arbitrary contamination that can cause the estimator to be infinite. For least squares, its breakdown points is 1/n which approaches zero as the sample size n increases.

Many robust estimators are suggested in the context of breakdown point. The most used high breakdown estimate is the LMS estimator, given by

$arg minθ med ri(θ),$

where ri(θ) is the the residual of the ith observation. A generalization of the least median of squares estimator is the LQS estimate, which minimizes the kth smallest squared residual.

We define k-LQS as

$arg minθ r(k)2(θ),$

where $r(i)2(θ)$ is the ith order statistic of $r(i)2(θ)$’s in descending order. Geometrically it corresponds to finding the narrowest strip covering k observations. The breakdown points of k-LQS decreases as k increases.

The computation of k-LQS estimates is complex. Several algorithms for LQS estimates have been proposed (Stromberg, 1993; Carrizosa and Plastria, 1995; Watson 1998). An LQS estimate involves determining Chebyshev solutions on all set of n+1 components. Stromberg (1993) used least squares and developed an algorithm to compute exact LQS estimates by considering θ̂′s based on all nCp+1 possible subsets of size p + 1. Watson (1998) provided alternative algorithms by using Gaussian elimination and modified linear programming.

We now propose new procedures for the identification of outliers in the regression by using k-LQS estimates. Residuals calculated from LQS estimates are used to separate the data into a clean subset and an outliers subset. The processes considered in this paper including Hadi and Simonoff’s perform the procedure for identifying outliers twice. One is for organizing a clean subset, and the other is for detecting candidates of outliers to be tested. Hadi and Simonoff used LS residuals in both procedures, but the two procedures need not be the same.

We suggest to use LQS residuals for constructing a clean set of Mand then to use LS residuals di’s in (1.1) for detecting and testing outliers. The test is performed using a t-distribution in this procedure. If LQS residuals are used in computing di’s in (1.1), the distribution of di’s under the null hypothesis is not clear. Test procedure using LQS residuals needs p-values calculated by Monte Carlo simulations. The proposed procedures use k-LQS estimates using all observations in building a new clean subset, a new clean subset is formed independently from the previous clean subsets including the basic subset. However, Hadi and Simonoff’s procedure should organize a good basic subset because it uses the LS residuals calculated from the fitted model of the previous clean subset to build a new clean subset. The proposed methods are independent of the results of the previous stages and are expected to be more robust than Hadi and Simonoff’s procedure in building a clean set.

Three newly suggested procedures denoted as S1, S2, and S3 respectively are described as follows. Let “M1” denote Hadi and Simonoff’s procedure using the first method for building a basic clean subset.

• S1: This method differ from M1 only in finding the basic clean subset. The basic clean subset is the set of observations with the smallest squared residuals calculated from k-LQS regression where k = int[(n + p − 1)/2].

The second method uses LQS estimates for constructing both a basic clean subset and new clean subsets.

• S2: The basic clean subset is obtained by the same method used in S1. After outlyingness test, if needed, a new clean subset of size k is constructed by using k-LQS estimates.

As a forward searching method neither LS, used in M1 nor LQS is perfect to avoid masking and swamping. The third method is designed to be relatively resistant to both masking and swamping effects which may occur in the process of S2. Let Ms denote the clean set of size s and $Msc$ be the complement of Ms, the set of potential outliers of size (ns). We calculate $γs=n(Msc∩Ms+1c)/n(Ms+1c)$ where n(Mi) is the size of Mi. If γs is small enough, it is highly probable that masking or swamping occurred. Thus if γs is small, both two subsets Ms and Ms+1 are used as clean sets. The Step 3 for S3 is described as follows.

• S3: Find a basic clean subset by using LQS estimates, then compute and order residuals according to (1.1). If |d|(s+1)t(α/2(s+1),sk), or |d|(s+1)< t(α/2(s+1),sk) and γsδ for some fixed value of δ (we use δ = 0.5), then form a new subset Ms+1 by taking the (s + 1) smallest residual observations calculated from (s+1)-LQS regression. Otherwise, continue S2 procedure with a new subset Ms+1 and also perform M1 procedure with Ms as a basic clean subset. The final outliers are determined as the union of outliers detected by two procedures.

We now show an example to illustrate the proposed procedures. Columns x1 and y1 in the Table 1 have 25 observations. The regular observations are generated from the model Yi = Xi+ɛi, i = 8, . . . , 25 where Xi ~ U(0, 15) and ɛi ~ N(0, 0.52). Seven outliers, case 1–7 are perturbed at xi = −4, xi = −5 + 0.2(i − 1) and Xi = 20 − 0.1(i − 1) to have Yi = Xi + 4 or Yi = Xi − 6 − 0.1(i − 1) respectively. Outlyingness of 7 cases are obvious as we see in Figure 1. Columns x2 and y2 are obtained by modifying 7 outliers closer to the regular trend.

When columns x1 and y1 are used S1, S2, and S3 correctly detected cases 1 to 7 as outliers but M1 identified only case 1 as an outlier. M1 fails because the basic clean subset contains none of true outliers. If columns x1 and y2 are used M1 and S1 detected only case 1 as an outlier, whereas S2 and S3 succeeded to detect 7 cases as outliers correctly. Table 2 and Table 3 contain subsets of outliers detected from the data of columns x2 and y2. The first column of the tables represents the size of subsets of outliers, decreasing from half of the data size down the table and the second column is the size of the clean subsets. The last column shows outcomes of the outlier test in which “Y” means significant. M1 and S1 show the same results because they form the same basic clean subset of size 12. They failed to find the true outliers because the basic subset contains none of true outliers. S2 starts with a wrongly identified basic clean subset and chooses subsets of potential outliers successfully containing all of the true outliers in the early stages. But it eventually fails to detect the true outliers because swamping occurred at s = 18. By applying S3 to the data, S2 is executed and M1 is also executed with the basic clean subset M14 where M14 = {1 2 3 4 5 6 7 12 15 23 25} because $M14c$ is quite different from $M13c={ 1 8 9 10 13 15 16 17 21 22 24 25}$. As seen in Table 3 M1 with a basic clean subset of M14 detects {1 2 3 4 5 6 7} as outliers; consequently, the final detection of outliers using S3 is {1 2 3 4 5 6 7}, the union of {1 2 3 4 5 6 7} and {1}.

A Monte Carlo experiment for comparing the power of M1, S1, S2, and S3 are performed. We use the same outlier patterns and simulation scheme as used in Hadi and Simonoff (1993). The simulation results are limited because the pattern of the predictor, the position of outliers and the sample size are fixed. The data sets are generated from the model Yi = Xi + ɛi where Xi ~ U(0, 15) and ɛi ~ N(0, 12) except two models H3(s) and H3h2(s) which use ɛi ~ N(0, 0.72). Outliers are planted at Xi = 20 − 0.05(i − 1) for high leverage cases and at Xi = −7.5 + 0.05(i − 1) for low leverage cases. The y values are given by Yi = Xi + 4. Table 4 shows the simulation results.

The first column of Table 4 indicates models. H and L refer to high and low leverage outlier respectively, a number indicates the number of outliers and “s” indicates a model using ɛi ~ N(0, 0.72). Notations used in the second column are defined as: p1 is the probability of identifying true outliers exactly. p2 is the probability of identifying at least one true outlier. p3 is the probability of identifying a clean observation as an outlier. The measures related to two negative effects of outliers are calculated as Pr(masking) = 1 − p2 and Pr(swamping) = 1 − p3.

The first row of Table 4 shows that all methods control the size of Type I error at 0.05 approximately. S1, S2, and S3 provided better overall performance than M1.

S3 gave the best results among the examined methods. S2 and S3 had a similar performance. The comparison of H3 and H3(s) shows that the difference in performance among M1, S1, S2 and S3 becomes larger if the ordinary observations are generated from the model with a small variance. The model H3h2(s) used a data set containing a group of 3 high leverage outliers and an additional group of 2 outliers positioned at lower right corner. The y values of two outliers are given by Yi = Xi − 4 for xi = −5 + 0.05(i − 1). The results of the model H3h2(s) show that the general patterns do not change when there is more than one group of outliers.

Hadi and Simonoff’s procedure and the suggested methods have low breakdowns as n and p increase. But as the number of outliers is bigger and n is smaller, we can expect that the effect of LQS-residuals increases compared to LS-residuals in selecting a clean set. The proposed methods are applied to data sets having multiple predictors. We summarize the results.

Modified wood gravity data set (Rousseeuw and Leroy, 1987) contains 20 observations of five predictors including four outliers. M1, S1, S2 and S3 succeed in identifying true outliers. Hawkins, Bradu and Kass data set (Hawkins et al., 1984) consists of 75 observations in three predictors. The first ten observations are high leverage outliers. Hadi and Simonoff reported that their method succeeded in detecting the cases 1–10 as outliers. But if the LMS approach is used in constructing a basic clean set, M1 identified cases 11, 12, 13, 14 as outliers. S1 also gave the same results whereas S2 and S3 correctly detected the outliers.

3. Summary and concluding remarks

In this article we have proposed three procedures for the identification of the outliers using k-LQS estimates. All three methods are related to the method proposed by Hadi and Simonoff (1993). The first method finds the basic clean subset using LQS estimates. The second method adopts the k-LQS approach for forming a clean subset of size k for the detection of outliers of size (nk). The third method is designed to be more resistant to swamping or masking problems by combining the first method and the second method. The procedures search a fixed number of outliers using k-LQS estimates; however, they do not require presetting the size of outliers. The size of outliers are determined by t-tests for the no-outlier hypothesis.

As compared with other methods the LQS approach needs more computation. But in large regression problems a subsampling algorithm (Rousseeuw and Leroy, 1987) for approximating LQS estimates can be used to avoid excessive computation.

Acknowledgement
This paper was supported by Konkuk University in 2020.
Figures
Fig. 1. Scatter plot of X1 and Y1 in .
TABLES

### Table 1

Artificial data

No.x1x2y1y2No.x1x2y1y2
1−4.00−4.000.000.00149.879.8710.1110.11
220.0020.0026.0024.00152.552.553.033.03
319.8019.9025.9023.90167.517.516.866.86
419.6019.8025.8023.80172.672.672.102.10
5−5.00−5.00−11.00−9.00184.404.403.743.74
6−4.80−4.90−10.90−8.90197.657.657.577.57
7−4.60−4.80−10.80−8.80207.017.016.406.40
811.3611.3611.1011.10211.281.281.051.05
911.6611.6611.9211.92224.484.484.724.72
100.200.20−0.27−0.27238.738.739.399.39
115.275.274.954.95244.364.364.634.63
1210.5210.5211.8311.83255.475.476.046.04
136.166.166.346.34

### Table 2

Results of M1, S1 and S2 for the dataset of x1 and y1 in Table 1

MethodM1, S1S2

msSubset of Potential Outliers ($Msc$)TestSubset of Potential Outliers ($Msc$)Test
1213(1 8 9 10 14 15 16 17 21 22 24 25)N(1 8 9 10 13 15 16 17 21 22 24 25)N
1114(1 8 9 10 15 16 17 21 22 24 25)N(1 2 3 4 5 6 7 12 15 23 25)N
1015(1 8 9 10 15 17 21 22 24 25)N(1 2 3 4 5 6 7 12 15 23)N
916(1 8 9 10 15 17 21 22 24 25)N(1 2 3 4 5 6 7 12 16)N
817(1 8 9 10 15 21 22 24 25)N(1 2 3 4 5 6 7 12)N
718(1 8 9 10 15 21 22)N(1 8 9 10 15 21 24)N
619(1 8 9 10 15 21)N(1 8 9 10 15 21)N
520(1 8 10 15 21)N(1 8 10 15 21)N
421(1 8 15 21)N(1 8 10 15)N
322(1 8 15)N(1 8 15)N
223(1 8)N(1 8)N
124(1)Y(1)Y

### Table 3

Results of M1 using $M14c$ of S2 in Table 2 as a basic clean set

msSubset of potential outliers ($Msc$)Test
1114(1 2 3 4 5 6 7 12 15 23 25)N
1015(1 2 3 4 5 6 7 12 15 23)N
916(1 2 3 4 5 6 7 12 25)N
817(1 2 3 4 5 6 7 12)N
718(1 2 3 4 5 6 7)N

### Table 4

Summary of Monte Carlo simulation

ModelProb.EstimatorsModelProb.Estimators

S1S2S3M1S1S2S3M1
Null0.0570.0490.0490.056Null0.0570.0490.0490.056

H1p10.5990.5950.6100.570H5p10.3520.3630.4170.347
p20.6630.6620.6710.621p20.3960.4020.4520.383
p30.0690.0740.0630.054p30.1390.1160.0890.125

L1p10.7620.7720.7820.756L5p10.6880.6810.6960.675
p20.8340.8310.8390.834p20.7550.7410.7560.746
p30.0720.0590.0570.078p30.0710.0630.0630.089

H3p10.5400.5120.5580.490H3(s)p10.8490.7550.8660.751
p20.6000.5750.5980.539p20.9230.8170.9270.811
p30.0800.0650.0510.066p30.0780.0670.0610.061

L3p10.7170.7230.7270.716H3h2(s)p10.6710.4830.7340.476
p20.7880.7790.7810.786p20.7630.5540.8090.542
p30.0690.0560.0540.074p30.1070.0830.0860.080

References
1. Atkinson AC (1994). Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89, 1329-1339.
2. Carrizosa E and Plastria F (1995). The determination of a least quantile of squares regression line for all quantiles. Computational Statistics & Data Analysis, 20, 467-479.
3. Hadi AS and Simonoff JS (1993). Procedures for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88, 1264-1272.
4. Hartigan JA (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76, 388-394.
5. Hawkins DM, Bradu D, and Kass GV (1984). Location of several outliers in multiple regression data using elemental sets. Technometrics, 26, 197-208.
6. Jajo Nethal K (2005). A review of robust regression and diagnostic procedures in linear regression. Acta Mathematicae Applicatae Sinica, 21, 209-224.
7. Kianifard F and Swallow WH (1989). Using recursive residuals, calculated on adaptive-ordered observations, to identify outliers in linear regression. Biometrics, 45, 571-885.
8. Kianifard F and Swallow WH (1990). A Monte Carlo comparison of five procedures for identifying outliers in linear regression. Communications in Statistics - Theory and Methods, 19, 1913-1938.
9. Marasinghe MG (1985). A multistage procedure for detecting several outliers in linear regression. Technometrics, 27, 395-399.
10. Paul SR and Fung KY (1991). A generalized extreme studentized residual multiple-outlier-detection procedure in linear regression. Technometrics, 33, 339-348.
11. Pena D and Yohai VJ (1999). A fast procedure for outlier diagnostics in linear regression problems. Journal of the American Statistical Association, 94, 434-445.
12. Rosner B (1975). On the detection of many outliers. Technometrics, 17, 217-227.
13. Rousseeuw PJ (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871-880.
14. Rousseeuw PJ and Leroy AM (1987). Robust Regression and Outlier Detection, New York, John Wiley & Sons.
15. Simonoff JS (1988). Detecting outlying cells in two-way contingency tables via backwards-stepping. Technometrics, 30, 339-345.
16. Stromberg AJ (1993). Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM Journal on Scientific Computing, 14, 1289-1299.
17. Watson GA (1998). On computing the least quantile of squares estimate. SIAM Journal on Scientific Computing, 19, 1125-1138.
18. Yohai VJ (1987). High breakdown-point and high efficiency robust estimates for regression. Annals of Statistics, 15, 642-656.