We present a survey of contributions that defined the nature and extent of robust statistics for the last 50 years. From the pioneering work of Tukey, Huber, and Hampel that focused on robust location parameter estimation, we presented various generalizations of these estimation procedures that cover a wide variety of models and data analysis methods. Among these extensions, we present linear models, clustered and dependent observations, times series data, binary and discrete data, models for spatial data, nonparametric methods, and forward search methods for outliers. We also present the current interest in robust statistics and conclude with suggestions on the possible future direction of this area for statistical science.
A customer with a basket of commodities queue at the counter to pay. A bar code reader is used to scan the profile of the product (color, length, size, volume, and price), similar process follows for all other commodities in the basket, the customer then decide whether to pay using cash, credit card, debit card, cash coupons, or another instrument, the customer then presents a loyalty card (that contains information of the typical market behavior in the past five years) and then a discount card is presented (also contains information related to the activities leading to the awarding of this discount card). The customer then checks out and the system links information from this transaction to other transactions in the historical profile of this customers. The process is repeated for another customer. Compilation of these information leads to a humongous database size: heterogeneous, complex sources, unknown or purposive samples, self-selection bias, and other features.
Are the observations random, identically distributed, normally distributed? Are there enough observations that will warrant the central limit theorem? Are there enough observations relative to the number of variables to ensure positive error degrees of freedom? These questions are asked on the kind of data extracted from huge compilation databases. Data analysis is dependent on the nature of data available, on how well the classical assumptions can hold, of whether the regularity conditions in asymptotic theory fairly hold. What if the data is from a different family of distributions? If extreme values, outliers, small far away clusters of interesting subjects are present? What if these atypical observations can serve as the anchor of innovation, e.g., customers who would exhibit unique behavior ideal for cross selling of specific package of goods and services? Should one trade statistical optimality over computational efficiency, robustness, and interpretability of the resulting information?
This explosion of information could have been the stimulus of the pioneering work of Tukey, Huber, and Hampel more than five decades ago when robust statistics was introduced. Tukey (1962) noted the unrealistically large data assumption of asymptotic theory, e.g., Mann and Wald (1942), which serves as the foundation of statistical inference. At that time when data are generated mostly within some controlled conditions, Tukey identified the growth areas of data analysis (which he claims to be much broader than statistical inference) to include treatment of “outliers”, “wild shots”, “blunders”, and “large deviations”. He further provided some guidelines on how the new data analysis can be initiated: seek out wholly new questions to be answered, tackle old problems in more realistic frameworks (with reference to regulatory conditions in asymptotic theory), find unfamiliar summaries of observational material, establish useful properties, and find rather than evade constraints. He further advices on the use of mathematical argument and mathematical results as basis for judgement rather than as bases for proof or stamps of validity.
There has been several definitions of robustness of estimators. Hampel (1971), defined robustness as follows: {
There were many survey papers on robust statistics so far. Huber (1972), summarized the accomplishments on robust estimation of parameters and has substantially covered various methods of robust estimation of location (M-estimators, L-estimators, and R-estimators). The survey paper also included other issues like order statistic estimations of, scale parameters (by taking logarithm to convert it into a location parameter estimation), multivariate location parameters. He also briefly discussed the implications in robust estimation of regression and analysis of variance, and has pointed out robust estimation in stationary time series data. Hampel (1973) also wrote a survey paper on robust estimation, but focused on stimulating discussions, especially in bridging understanding and cooperation between the pure mathematician and data analyst.
A relatively more recent survey paper of Huber (2002) summarizes the contributions of Tukey to robust statistics. There has less published work on the contribution of Tukey in robust statistics since many of his contributions were included only in lectures, technical reports, and informal discussions with colleagues and students. His 1962 paper ‘The Future of Data Analysis’ of course, summarizes his thoughts on robust statistics. Huber (2002) presented the contributions of Tukey. This can be gleaned from the philosophical issues dominating his work on data analysis, his views are often expressed in a balanced fashion. Huber (2002) noted that Tukey showed that inference in a sample-to-population sense is only part, not the whole, of statistics and data analysis. Tukey also encouraged simulation rather than on rigorous proof, a similar principle that led towards the growth of computational statistics. Huber (2002) concluded that Tukey wants a data analyst to be more of a scientist than a pure mathematician.
Some books covered a wide range of materials on robust statistics. Rieder, (1996) edited a collection of various papers, while Hampel
This paper includes: definitions and basic robust estimates of location as presented in the next section, various issues in robust modeling in Section 3, robustness in spatial analysis as discussed in Section 4, robustness and nonparametric statistics as presented in Section 5, the Forward Search method as presented in Section 6, applications in Section 7, and; recent development and future directions as presented in Section 8.
Robustness can be defined either qualitatively or quantitatively. Let
Huber (1972) discussed three basic estimators of location: M-estimators, L-estimators, and R-estimators.
M-estimators (maximum likelihood-type)
Let
L-Estimator (linear combination)
Let
R-Estimators (rank-based)
Consider a two-sample rank test for shift,
The extent of influence of aberrant observations to the flexibility of a statistic is measured through the Influence Curve that account for a scaled differential influence of one additional observation
Mean:
Variance:
M-estimator:
There are many themes on various contributions in robust statistics. Some focused on outlier detection (single and multiple outliers). Once the outlier is detected, it is of interest to measure its influence to the statistic or the method in general. Upon confirmation of its influence, the challenge is on how to mitigate the influence of these observations to statistics, models, or methods.
Initially, robust methods focused on the estimation of location parameters with Huber (1964) providing the early illustration of robust estimation of a location parameter. He developed a new approach of estimation from a contaminated normal distribution using the asymptotic theory of estimators. The estimator is a hybrid between the mean and the median and is asymptotically robust among all translation invariant estimators. Sacks and Ylvisaker (1972) showed that Huber estimator works for a more general class of symmetric distributions.
Many robust methods were subsequently introduced on various themes of data analysis that include variance estimation, hypothesis testing, outlier detection vs robust estimation of linear models, generalized linear models, time series models, and more recently, high dimensional data. Another new approach to robustness that is more data dependent called forward search algorithm was introduced in Atkinson (1994).
Given
The estimate of
Huber (1973) generalized the estimator for location into the parameters of a regression model. The M-estimator of the regression coefficients,
The contributions on various aspects of robust estimation of a linear model vary in terms of peculiar behavior of a model, variable selection, hypothesis testing, and hybrid procedures. We now present these contributions on robust linear modeling in this section.
A linear model with variances parameterized with the mean and another parameter
The advantages of OLS cannot be discounted, but the impact of aberrant observations also needs to be mitigated. Using a connected double truncated gamma distribution for the error distribution of a linear model, Nassiri and Loris (2013) estimated a generalized quantile regression, the resulting estimates inherits the advantages of both the OLS (differentiable loss function) and quantile regression (robustness to outliers). This illustrates the potential gains in methods resulting from a hybrid of two or more methods which can benefit from the advantages carried by each of the individual methods.
A linear model assumes that the covariates are pre-determined. If instead, measurement errors are present in the covariate data, OLS is known to be inconsistent, even some robust estimates could suffer, e.g., bias. Wei and Carroll (2009) proposed to correct problems with measurement error in the covariate in a quantile regression model though the EM-type estimation algorithm which jointly estimates the equations for all quantile levels.
Variable selection is another big topic in modeling, especially for linear models. Using the least absolute shrinkage and selection operator (LASSO), Li and Zhu (2008) illustrates the advantage of simultaneously controlling the variance of the fitted coefficients and performing automatic variable selection. An efficient algorithm that computes the entire solution path was provided, along with a method of selecting the regularization (penalty) parameter. Alhamzawi (2015) proposed a variable selection method in quantile regression, also based on the LASSO penalty. Simulated and actual data sets provide empirical evidence on the advantage of the method compared with other procedures.
Furno (2004) developed a test for conditional heteroskedasticity in quantile regression, where distributional assumption on the error term nor the pattern that characterizes heteroskedasticity need not be specified. A lack-of-fit test for quantile regression (linear or nonlinear) was developed by He and Zhu (2003). The test does not involve nonparametric smoothing; however, it is consistent for all nonparametric alternatives without any moment conditions on regression error.
There are many other stylized facts about the linear model that are incorporated into the robust estimation procedures. Many other tests are also proposed with themes that revolve around nonparametric approaches and asymptotic methods.
Clustering and dependence (cross-sectional) of observations often occur interchangeably, leading to extreme values among few members of a cluster, members in some clusters, or across all observations in a few clusters. This typically occurs during epidemics where the response variable is a count of the number of infected individuals.
Given independent but not identically distributed data, e.g., clustered observations, Beran (1982) developed an asymptotically robust estimator within a reasonable choice of contamination neighborhood, defined as the region where the parametric model does not hold true. Moscone and Tosetti (2015) proposed an estimator of the covariance matrix of the fixed effects and mean of the group estimator in a panel data that are cross-sectionally dependent and serially correlated, possibly occurring as a result of clustering.
Heteroskedastic linear models are postulated as a result of clustering at the least, and general dependence structure in the worst case. In testing hypothesis pertaining to heteroskedastic linear models, Zhao and Wang (2009) developed three classes of testing procedures (Wald-type, score-type, and drop-in dispersion tests). For a robustness criterion, the maximum asymptotic bias of the level of the test for distributions in a shrinking contamination neighborhood is used and the most-efficient robust test is derived.
Supposed that data are naturally distributed into groups and that a fixed group effect regression is postulated, Pérez
A semiparametric approach using empirical likelihood was used by Kim and Yang (2011) in estimating a random effects quantile regression of clustered data. Assuming independence of the random regression coefficients with a common mean (corresponds to the population-average effects of explanatory variables on the conditional quantile) and the random coefficients that represent cluster-specific deviations in the covariate effects, the estimator of the population-level parameter is asymptotically normal, and that the estimators of the random coefficients are shrunk toward the population-level parameter in the first-order asymptotic sense.
He
In lieu of robust methods, Field
The nature of autocorrelation structure causes the vulnerability of time series models from outliers. In autoregressive models for instance, Campano and Barrios (2011) noted that parameter estimates using conditional least squares are affected by outliers occurring among more recent observations.
There are some recommendations though on how to robustify estimation in time series models. For example, de Luna and Genton (2001) proposed a robust estimation method of an ARMA model. The proposal is to use the observed time series in estimating the model. Data is then simulated from an ARMA model with assumed parameters chosen so that observation-based and simulation-based auxiliary parameters are comparable. The influence of outlying observations on the parameters are properly mitigated and achieve robustness since the simulation-based estimation procedure inherits the properties of the auxiliary model estimator.
Chang
While issues on the influence of outliers in estimation need to be addressed, similar concerns with outliers are as important in hypothesis testing. Xiao (2012) presented a robust inference strategy in unit root and cointegration models, e.g., testing for cointegration which is built on the residuals of models whose parameters are estimated through M-estimators. The test is fairly robust even if the normality assumption is not valid.
In binary or discrete response models, the likelihood function is often at risk of separation when there is unbalanced distribution of data into the different values of the response or there are extreme values. Methods that would mitigate the impact of the nature of binary/discrete data are available through sample selection (see for example Santos and Barrios (2015)) or through robust estimation, some of which are presented next.
Given continuous and binary regressors, Hubert and Rousseeuw (1997) provide a method of down-weighting leverage points by computing robust distances in the space of the continuous regressors. They compute weighted least absolute deviations (LAD) as a function of both the continuous and binary regressors with derived robust estimates of the error scale.
Heteroskedasticity and contaminated data (outliers) can easily influence the resulting estimates since MLE is based on the quadratic norm. Číźek (2008) developed a robust estimation method for binary-choice regression through a maximum symmetrically trimmed likelihood estimator (MSTLE) and design a parameter-free adaptive procedure to choose the amount of trimming. The adaptive MSTLE was noted to preserve the robust properties of the original MSTLE, but significantly improves the finite-sample behavior of MSTLE, and also ensures the asymptotic equivalence of the MSTLE and maximum likelihood estimator if there is no contamination.
Using the quasi-likelihood for a generalized linear model, Cantoni and Ronchetti (2001) defined a robust deviance measure that is used in model selection in the context of classical modeling framework. We investigated asymptotic distributions of robust deviance-based tests were derived and stability of the tests under contamination.
There are many themes on robust statistical analysis besides location estimation, model estimation, or hypothesis testing. For example, models that account for spatial interdependencies of adjacent units are estimated in spatial analysis. The variogram has been used to quantify such dependencies and has been a crucial input to estimate spatial models. The estimation of variogram has always been a challenge because there are limited number of data points since measurements are too costly specially in applications in mining, geology, and other geosciences. When outliers are present in already a very small dataset, estimation of spatial models becomes extremely challenging. With a fourth root transformation applied in the data, Cressie and Hawkins (1980) assumed a normal-like central region and possibly, heavy-tails in the periphery, in the construction of a stable and robust estimate of the variogram.
A spatial autoregressive (SAR) models that is a special parametrization of the spatial model generally yield inconsistent estimates when the error term exhibits unknown form of heteroskedasticity. Dogăn and Taşpinar (2014) developed robust estimators based on the generalized method of moments (GMM) and Bayesian Markov Chain Monte Carlo (MCMC) frameworks. Monte Carlo simulation indicates greater bias on the spatial autoregressive parameter when there is a negative spatial dependence in the model; however, finite sample efficiency of MCMC is better than the robust GMM estimator.
The natural linkage of robust and nonparametric methods are clearly established in the literature. In fact, Hettmansperger and McKean (1988) summarized all these methods into a book on robust nonparametric statistical methods.
Combining kernel methods for density estimation and robust estimation of location, Hardle (1984) developed estimators that are weak and strongly consistent and asymptotically normal under mild condition on the kernel sequence. The estimator is also minimax robust in the sense of Huber (1964).
Reviewing various robust nonparametric methods, Hettmansperger
Truncated and censored regression estimated via MLE are naturally sensitive to data contamination like outliers since MLE is based on quadratic norm. Thus, Čížek (2012) proposed a semiparametric general trimmed estimator (GTE) which is imprecise but robust. This is further improved through data-adaptive and one-step trimmed estimators like one-step symmetrically censored least squares (SCLS). This improvement is as robust as GTE and asymptotically equivalent to the original estimator (e.g., SCLS).
Hastie and Tibshirani introduced the generalized additive models; consequently, many variants have been introduced such as the combination of parametric and nonparametric components into a semiparametric model, see for example, Hastie and Tibshirani (1990). Hoshino (2014) introduced a partially linear additive quantile regression model where conditional quantile function comprises a linear parametric component and a nonparametric additive component and proposes a two-step estimation approach. The first step approximates the conditional quantile function using a series estimation method. The second step estimates the nonparametric additive component using either a local polynomial estimator or a weighted Nadaraya-Watson estimator. Consistency and asymptotic normality were established.
Wong
From a different framework, forward search methods diagnose the presence of outliers through a highly data-dependent algorithm. The nature of existing data dictates how the search algorithm will progress. The algorithm starts with the selection of a ‘clean’ set of observations, a subset of the original dataset. A criterion is used to ensure that the initial dataset will not contain observations that may influence the estimate of the parameters or the method of analysis to be conducted. The initial data and the corresponding estimates or models then serve as a benchmark from where the decision on whether to include the next observation into the dataset will be based on. An observation outside the ‘clean’ set is included
There are some issues associated with the implementation of the forward search, e.g., selection of the initial dataset, criteria to be used in deciding whether an observation is allowed to enter the dataset, and the stopping rule, i.e., how to measure the influence of observations that are included into the dataset.
Prior to Atkinson (1994) who formally introduced the forward search, similar data-dependent robust methods were introduced where the ‘clean’ dataset are identified via random searches. Atkinson (1994) considered a few repeats of a simple forward search starting from a random starting point in order to reveal masked multiple outliers. The method generates observations that sufficiently produced robust parameter estimates. Atkinson (1994) also suggested that parallel computing provides appreciable reduction in computational time for large datasets. Since then, various contributions that modifies the forward search to address specific estimation or modeling problems has been provided in the literature.
Atkinson and Cheng (2000) extended a very robust regression method for missing data to data with several outliers, i.e., outliers are imputed as if they were missing. The method that uses a hybrid of the forward search and the EM algorithm were able to generate robust estimates in the presence of outliers.
The forward search was also used by Bertaccini and Varriale (2007) to detect atypical observations and the analysis of their effect in the ANOVA framework. In clustering multivariate normal data, Atkinson and Riani (2007a) introduced a forward search method based on the Mahalanobis distances of the observations. A new forward plot was introduced, this provided an efficient monitoring device to keep to tract robustness of the resulting clusters.
Atkinson and Riani (2007b) also introduced forward search in variable selection in regression modeling. The forward search is imbedded into the backward elimination procedure (variable selection) while monitoring the influence of observations that might exhibit aberrant behavior that is localized to a particular variable.
Highlighting the importance of forward plots in visualizing the inferential contribution of each observation, Atkinson (2009) introduced the forward search in the robust analysis of econometric data. Mavridis and Moustaki (2009) implement a forward search algorithm for identifying atypical subjects/observations in factor analysis models for binary data.
There were many work that either uses the forward search algorithm singly, or in tandem with another method to estimate various models. Riani (2004) modified the forward search in the analysis of structural time series data. The forward plots provided a good monitoring tool showing how the forward search, can detect the main underlying features of the time series with masked multiple outliers, level shifts, or transitory changes.
Time series and spatio-temporal models are often characterized by temporary structural change that could occur due to some confounding events affecting the random shock to behave wildly. While these are temporary, it can potentially influence the estimates of the models that might fail to keep track of the behavior of the data generating process under ‘normal’ circumstances. Campano and Barrios (2011) proposed an estimation procedure based on a hybrid of block bootstrap and a modified forward search algorithm. The procedure yields robust estimates of a time series model with data that exhibits temporary structural change provided that the time series is fairly long. However, Bastero and Barrios (2011) postulated and estimated a spatio-temporal model and estimated it using a procedure that infuses the forward search algorithm and maximum likelihood estimation into the backfitting framework. The forward search algorithm filters the effect of temporary structural change in the estimation of covariate and spatial parameters
While many results in robust statistics were generated in the course of trying to solve specific real-life problems, we have discussed some application of robust statistics in various areas in this section.
To measure effective dose (ED), estimators based on parametric models (e.g., probit models) are generally efficient but are not robust to model misspecification. While nonparametric estimators are robust to model misspecification, they are also known to be less efficient. Karunamuni
Dang
Stimulated by irregular return patterns (e.g., volatility) in financial assets, Huang (2012) proposed to use a uniformly spaces series of regression quantiles in volatility forecasting. Six of the seven stock indices with 16 years daily return data yield better volatility forecasts compared to existing methods. However, Gaglianone
In remote sensing data, Cao and Wang (2015) developed a probabilistic robust learning algorithm for neural networks with random weights (NNRWs) to improve the modeling performance.
Kitromilidou and Fokianos (2015) is interested in robust estimation of a log-linear Poisson model for count time series analysis. MLEs were robustified though the inclusion of intervention terms into the model that was then estimated with bounded-influence estimator.
In image interpolation, Huang and Siu (2015) proposed a robust and precise k-nearest neighbors (k-NN) searching scheme to form an accurate AR model of the local statistic. They make use of both low-resolution (LR) and high-resolution (HR) information obtained from large amounts of training data to form a coherent soft-decision estimation of both AR parameters and high-resolution pixels.
There are several contributions on robustness in the context of high dimensional data (HDD). In high breakdown point estimates of the multivariate location and scatter, computing time increase rapidly with the number of variables, this is impractical for high dimensional data, Maronna and Zamar (2002). They proposed estimator based on modified Gnanadesikan-Kenttenring robust covariance estimate that is much faster especially for large dimension. Wang
Where is robust statistics going to? As Prof. Huber himself noted during the 58
There will be more heterogeneity, more complicated data structure, space-time dependency as the size of the data becomes really large, aberrant observations may no longer be rare but non-ignorable in size. The role of additive models and bootstrap methods in better understanding the data generating process is indispensable. Ironically as the data becomes very large, analysis will be gradually less dependent on asymptotic theory and methods will be more dependent on empirical evidence instead.