Gaussian error distributions are a common choice in traditional regression models for the maximum likelihood (ML) method. However, this distributional assumption is often suspicious especially when the error distribution is skewed or has heavy tails. In both cases, the ML method under normality could break down or lose efficiency. In this paper, we consider the log-concave and Gaussian scale mixture distributions for error distributions. For the log-concave errors, we propose to use a smoothed maximum likelihood estimator for stable and faster computation. Based on this, we perform comparative simulation studies to see the performance of coefficient estimates under normal, Gaussian scale mixture, and log-concave errors. In addition, we also consider real data analysis using Stack loss plant data and Korean labor and income panel data.
In the traditional regression model
In the ML framework, if the error distribution is assumed to be a family of heavy tailed distributions, the MLE of the regression parameters is known to be robust (Lange
A skewed error distribution may not be a pure interest because there should be a suitable transformation to make the error distribution symmetric. However, transformation could blur the statistical inference for the mean function and such transformation is not always available. In addition, in some cases, the error distribution itself could be of great interest. For example, the quality of Value-at-Risk estimator in the time series model is severely affected by the quality of the estimated error distribution.
We consider a family of log-concave distributions for the error distribution due to their many attractive features. It contains many well-known parametric distributions such as normal, t, and gamma distributions. It is also known that marginal distributions, convolutions, and product measures of log-concave distributions preserves log-concavity, see Dharmadhikari and Joag-Dev (1988). Dümbgen
In this paper, we investigate the performance of estimators under the normality, log-concavity, and Gaussian scale mixture with some real data examples. In addition, we propose the use of a smoothed log-concave distribution to estimate of the regression parameters that can make the computation more reliable and faster than raw estimators. We also derive a formula for an iterative reweighted least squares based on the smoothed log-concave density. This paper is organized as follows. In Section 2, we review some literature related to the continuous Gaussian scale mixture and log-concave densities with their computational aspects. In Section 3, we propose an estimation method in which the error distribution is assumed to be log-concave. In Section 4, we also adopt the smoothed log-concave MLE for the inference of regression parameters. Some numerical studies including real data analysis are given in Section 5. We then end this paper with some concluding remarks in Section 6.
A family of Gaussian scale mixture densities is given as
where
The advantage of the use of this family is that a smooth density estimator is obtained without any tuning parameter. If we assume that the error distribution in the regression analysis belongs to this family, it is known that the MLE of regression parameters is robust to traditional outliers (Seo
The existence and uniqueness of the MLE of
If
Based on this observation, the algorithm finds
A family of log-concave densities can be expressed as
where
Walther (2009) showed that the NPMLE of a log-concave density exists and explain how one can develop an algorithm to estimate the NPMLE. Suppose that
where is the empirical cumulative distribution function (CDF) based on the sample. Since
see Silverman (1982, Theorem 3.1). It was proven that the NPMLE
That is,
where
That is, we can turn our estimation problem into a linearly constrained optimization problem. There are several algorithms to solve this linearly constrained optimization problem such as the Iterative Convex Minorant Algorithm (ICMA) and the active set algorithm. It was shown that the active set algorithm is generally more efficient than other existing algorithms, see Rufibach (2007).
In the linear regression model,
where
There is no direct way to find the maximizers
Determine
Adjust the parameter
Determine
In (a), if
As mentioned in Section 3, a characteristic feature of the MLEs of log-concave densities is that they are not smooth and not differentiable at each knot. Dümbgen
where
for some real numbers
where Φ(·) is the CDF of the standard normal density.
Now, for our regression problem, we first compute the residuals
where
Further, if we denote
and
which is the form of iterative reweighted least squares.
We summarize the above as:
For given
Compute
For given
Repeat (1)–(3) until a stopping rule is satisfied.
In this section, we conduct some Monte Carlo simulation studies to see the performance of two different methods. We generate 200 simulated samples from the model
When the true error distribution is
For the first real data example, we consider Stack Loss Plant data (Brownlee, 1960). This dataset has been used in many robust regression literature because it has some severe outliers. Bellio and Ventura (2005) showed that observations 1, 3, 4, and 21 are those outliers. The dataset contains four variables (Air Flow, Water Temp, Acid Conc, and Stack Loss). For this dataset, we consider a linear regression model using Stack Loss variable as a response and other variables as covariates.
For model fitting, we use OLS, SMMLE, and LCMLE. Tables 2 and 3 show the estimated regression coefficients and the corresponding standard errors in the parentheses with the data excluding outliers and the original data, respectively. The standard errors of SMMLE and LCMLE were obtained by the bootstrap method. As it is known that OLS is very sensitive to outliers, OLS shows large differences between estimated parameters for each case. LCMLE also shows large differences in the regression parameters. Unlike SMMLE, LCMLE is not so robust to severe outliers with a small sample. Especially when the outliers are highly skewed, LCMLE also becomes skewed and this results in non-robust regression parameter estimates. Figure 2 shows the histograms of residuals and the corresponding estimated error distributions for each method. In this figure, the solid line represents the estimated error distribution from the original data and the dashed line shows the estimated error distribution from Stack Loss Plant data without outliers. Both OLS and LCMLE show large differences between estimated distributions from each data. However, SMMLE shows small difference. That is, outliers have a great influence on the estimation of the parameters and distribution in OLS and LCMLE.
Second, we consider 19th KLIPS (Korean Labor & Income Panel Study) Data. KLIPS is a longitudinal survey of the labor market/income activities of households and individuals residing in urban areas. With KLIPS data, we try to analyze the effect of personal properties on income. For this, we consider a linear regression model using pre-tax annual income as a response and gender, age, educational background, and residence area as covariates. Covariates are selected by referring to various research papers on determinants of income. The age variable was divided into five categories: under 30, between 30 and 40, between 40 and 50, between 50 and 60, and over 60. That is, there are 7 independent variables including 4 dummy variables.
Figure 3 shows that the distribution of income in the KLIPS data is severely skewed as it is known that the income distribution tends to be skewed. The left is the histogram of the original income and the right is the histogram of log-transformed income. We use the log-transformed income for our model as we do generally when the data is skewed. Table 4 shows the estimated regression coefficients and the corresponding standard errors in the parentheses. We can check that the standard errors of LCMLE are smaller than those of any other method. Figure 4 shows the histograms of residuals and the corresponding estimated error distributions for each method. In OLS, there is a large gab between the estimated error distribution and the histogram of residuals. In SMMLE, the gap is reduced, but it still shows some difference. However, in LCMLE, they match very well. Table 5 shows 95% bootstrap confidence intervals for the parameters which are calculated based on each method. We can also see that the confidence intervals obtained from LCMLE tend to be shorter than the others.
A regression model based on the Gaussian scale mixture error has a comparable or superior performance to other robust regression estimators. One potential limitation of this model is that the estimation may be unreliable when the true error distribution is not symmetric. The family of log-concave densities contains many skewed distributions so that the regression model based on log-concave errors is quite flexible to estimate regression parameters even though the true error distribution is skewed. In this paper, we study the estimation of regression parameters and error distributions with Gaussian scale mixture densities and log-concave densities, as well as compare them by using some numerical examples.
The estimation with log-concave densities can be conducted by a three-step alternating algorithm. In the first step, we proposed the methodology for finding the MLEs for regression parameters with a smoothed version of the log-concave MLE, which produces an iterative reweighted least square expression. We find that the proposed method is stable and efficient to estimate regression coefficients in multiple linear regression even with a large sample size.
Simulation results show that the estimation with log-concave densities is as good as other methods in normal and heavy-tailed cases, and it has a remarkable performance in a skewed case. However, the estimator under log-concave errors is sensitive to outliers when there are severe outliers in a small sample. It seems that our proposed method could still be robust when the outliers occur in a symmetric fashion. However, when there are only extremely large (or small) outliers, the proposed method produces highly skewed log-concave density estimators. This would be the reason why the proposed estimator is not robust in general. On the other hand, since the SMMLE assumes that the error distribution is symmetric, the density estimate is not so heavily affected by skewed outliers regardless of the existence of large (or small) outliers. In this case, the SMMLE produces a symmetric but heavy tailed density estimator. This also explains why the proposed method works well in simulation studies even though it is not robust for the the real data analysis with a small sample size in which some large outliers exist.
Empirical MSE × 100 (empirical bias × 100) for
Error | Method | ||||
---|---|---|---|---|---|
250 | (I) | OLS | 0.81 (1.74) | 1.74 (−1.41) | 0.39 (−0.24) |
SMMLE | 0.82 (1.66) | 1.84 (−1.26) | 0.40 (−0.27) | ||
LCMLE | 0.85 (1.55) | 2.00 (−1.03) | 0.42 (−0.27) | ||
(II) | OLS | 0.77 (0.28) | 1.72 (−0.94) | 0.40 (−0.16) | |
SMMLE | 0.71 (0.21) | 1.30 (−0.80) | 0.28 (−0.02) | ||
LCMLE | 0.71 (0.16) | 1.40 (−0.72) | 0.30 (−0.11) | ||
(III) | OLS | 0.76 (−0.01) | 1.49 (−0.12) | 0.35 (0.44) | |
SMMLE | 0.67 (0.07) | 1.21 (−0.23) | 0.26 (0.39) | ||
LCMLE | 0.51 (0.30) | 0.50 (−0.74) | 0.11 (0.22) | ||
500 | (I) | OLS | 0.34 (−0.19) | 0.81 (0.40) | 0.19 (−0.83) |
SMMLE | 0.35 (−0.18) | 0.84 (0.38) | 0.19 (−0.72) | ||
LCMLE | 0.35 (−0.11) | 0.88 (0.25) | 0.21 (−0.87) | ||
(II) | OLS | 0.40 (0.20) | 0.69 (0.13) | 0.18 (−0.02) | |
SMMLE | 0.33 (0.31) | 0.54 (−0.11) | 0.13 (0.10) | ||
LCMLE | 0.34 (0.27) | 0.55 (−0.02) | 0.14 (0.07) | ||
(III) | OLS | 0.40 (0.49) | 0.83 (−0.54) | 0.21 (0.46) | |
SMMLE | 0.34 (0.62) | 0.67 (−0.81) | 0.14 (0.49) | ||
LCMLE | 0.27 (0.51) | 0.28 (−0.56) | 0.06 (0.22) |
MSE = mean squared error; OLS = Ordinary least square; SMMLE = smoothed MLE; LCMLE = log-concave MLE; MLE = maximum likelihood estimator.
Estimated parameters and standard errors for the Stack Loss Plant data excluding outliers
Method | ||||
---|---|---|---|---|
OLS | −37.6525 (4.7321) | 0.7977 (0.0674) | 0.5773 (0.1660) | −0.0671 (0.0616) |
SMMLE | −37.5401 (5.2553) | 0.8055 (0.1006) | 0.5558 (0.1970) | −0.0682 (0.0807) |
LCMLE | −39.0334 (5.1879) | 0.7958 (0.0993) | 0.6017 (0.1819) | −0.0555 (0.0734) |
OLS = Ordinary least square; SMMLE = smoothed MLE; LCMLE = log-concave MLE; MLE = maximum likelihood estimator.
Estimated parameters and standard errors for the Stack Loss Plant data including outliers
Method | ||||
---|---|---|---|---|
OLS | −39.9197 (11.8960) | 0.7156 (0.1349) | 1.2953 (0.3680) | −0.1521 (0.1563) |
SMMLE | −36.0085 (9.7895) | 0.8416 (0.2043) | 0.4838 (0.6130) | −0.0872 (0.1433) |
LCMLE | −37.5101 (10.7084) | 0.6759 (0.1879) | 1.3638 (0.5470) | −0.1690 (0.1404) |
OLS = Ordinary least square; SMMLE = smoothed MLE; LCMLE = log-concave MLE; MLE = maximum likelihood estimator.
Estimated parameters and standard errors for 19th KLIPS data
Method | OLS | SMMLE | LCMLE |
---|---|---|---|
6.9362 (0.0629) | 7.0528 (0.0879) | 7.0013 (0.0551) | |
−0.5377 (0.0171) | −0.5083 (0.0192) | −0.4827 (0.0141) | |
0.5938 (0.0331) | 0.4451 (0.0504) | 0.3417 (0.0262) | |
0.7220 (0.0324) | 0.5695 (0.0516) | 0.4903 (0.0263) | |
0.7051 (0.0324) | 0.5388 (0.0527) | 0.5049 (0.0308) | |
0.1368 (0.0371) | 0.0358 (0.0556) | 0.0605 (0.0333) | |
0.1750 (0.0073) | 0.1716 (0.0091) | 0.1840 (0.0068) | |
0.0590 (0.0166) | 0.0418 (0.0155) | 0.0279 (0.0128) |
OLS = Ordinary least square; SMMLE = smoothed MLE; LCMLE = log-concave MLE; MLE = maximum likelihood estimator.
95% confidence intervals for estimated parameters for 19th KLIPS data
OLS | SMMLE | LCMLE | ||||
---|---|---|---|---|---|---|
2.5 % | 97.5 % | 2.5 % | 97.5 % | 2.5 % | 97.5 % | |
6.8130 | 7.0595 | 6.8241 | 7.1792 | 6.8942 | 7.1073 | |
−0.5713 | −0.5041 | −0.5370 | −0.4592 | −0.5095 | −0.4570 | |
0.5290 | 0.6586 | 0.3224 | 0.5354 | 0.2918 | 0.3918 | |
0.6585 | 0.7855 | 0.4522 | 0.6619 | 0.4386 | 0.5404 | |
0.6383 | 0.7719 | 0.4217 | 0.6372 | 0.4415 | 0.5619 | |
0.0640 | 0.2097 | −0.0799 | 0.1373 | −0.0031 | 0.1256 | |
0.1606 | 0.1893 | 0.1629 | 0.1979 | 0.1712 | 0.1971 | |
0.0265 | 0.0916 | 0.0079 | 0.0684 | 0.0043 | 0.0552 |
OLS = Ordinary least square; SMMLE = smoothed MLE; LCMLE = log-concave MLE; MLE = maximum likelihood estimator.