A statistical model is often characterized by its mean function. However, often the correct form for the mean function does not follow the assumed one. A transformation of variables is needed to achieve an assumed mean function in the transformed scale. We consider a partial linear model in regards to the problem of transformation. A standard linear regression model is a basic tool for analyzing statistical data and is widely used due to its simplicity. In some problems the linear relationship between the response and all covariates are not known. A partial linear model is more flexible by incorporating the nonlinear functional relationship in a general linear model. The model with response transformations in partial linear model is given as,
where
In this paper we deal with response transformations in the partial linear models under the existence of outliers. The existence of outliers in the data is a common problem in a statistical analysis. Many approaches for detecting multiple outliers are suggested in a linear model, for example sequential procedures (Hadi and Simonoff, 1993), high-breakdown methods (Rousseeuw, 1984; Yohai, 1987) and forward searches (Atkinson, 1994). In the discussion of outliers, the difference between outliers and influential observations in estimating transformations should be understood. An influential observation is one whose deletion has a large effect on the transformation estimates. Cheng (2005) suggested a robust method for response transformation against influential observations in a linear model. They used the least trimmed squares estimator and the trimmed likelihood estimator. An outlier is an observation that diverges from an overall pattern formed by the transformed data. Seo
This paper presents a graphical procedure for a robust transformation using an outlier detection method and dynamic plots. Section 2 suggests a dynamic graphical method for the transformation and the outlier detection in a partial linear model. The method involves augmented partial residual plot for specifying the curvature and a sequential procedure for detecting outliers. Section 3 provides several examples with artificial data and real data to illustrate the suggested method. Section 4 contains some concluding remarks.
Determining the coefficient of optimal response transformation in a partial linear model is difficult to solve analytically. We suggest an exploratory procedure of observing related plots for many transformations. Plots for each transformation need the estimation of a partial linear model and the detection of outliers. Graphical methods are used for the estimation of a partial linear model. Many graphical methods are suggested to specify the curvature in a partial linear model including added variable plot (Chamber
With a fixed transformation and the estimated curvature in the model (1.1) an outlier detection method designed for linear models can be applied. In this paper a sequential procedure proposed by Hadi and Simonoff (1993) is used, which consists of three steps, constructing a clean set, calculating residuals and testing for outliers. For the construction of initial clean subset Hadi and Simonoff (1993) suggested two methods. The first method fits the model of
The set of outlier candidates is determined by the absolute value of
For an exploratory analysis to figure out the transformation, the curvature and outliers simultaneously animation techniques in plotting is usually used (Seo and Yoon, 2009). We use augmented partial residual plots and forward response plots which are animated as
- Fix a value of
- Use Hadi-Simonoff ’s procedure with
- Calculate the fitted values
- Draw an augmented partial residual plot and a forward response plot (
- Change
- Stop at which the clean cases in a forward response plot show a linear trend.
During the procedure the curve
Figure 1 shows an augmented partial residual plot and a forward response plot and a slider for changing
Artificial data without outliers
A dataset shown in Table 1 is artificially generated according to the model,
where variables
For example, when
Artificial data with outliers
Some observations in the Table 1 are modified to include outliers. Three outliers are planted at the 10
A real data (Nitrogen in lakes data)
Nitrogen in lakes data (Atkinson and Riani, 2000, p.297) include 29 observations on the amount of nitrogen in US lakes with the variables,
Stromberg (1993) fit the model using least median of squares estimate and MM estimate. The fitted model yielded
We fit a model (1.1) and conducted the dynamic graphical procedure. Augmented partial response plots and forward response plots are shown in Figure 4. Judging from forward response plots the candidates of optimal estimate of
Artificial data (for exploratory analysis)
This example is to illustrate the effectiveness of the exploratory procedure using dynamic plots. Eighteen observations are generated from the model,
where covariates
From the augmented partial residual plots for many values of
Considering these information two observations 19 and 20 are deleted from the data and the optimal value of
The problem of outliers detection and response transformation in a partial linear model is difficult to handle analytically. An exploratory procedure is suggested as a unified method to solve the problem. A procedure combining outlier detection methods and graphical techniques is proposed to provide an appropriate variable transformation robust to outliers. Diagnostic measures are calculated from the data excluding outliers and are plotted. Examples show that it is possible to examine the role of observations in the diagnostic point of view through the dynamic plots.
The suggested procedure uses a sequential method to detect outliers, an augmented plot for estimating a curvature and the forward response plot for observing fitness. The performance of the proposed procedure depends on outlier detection methods and curvature estimation. If a dataset is large and has enough repeated observations to estimate
This paper was supported by Konkuk University in 2018.
Dynamic plots: An augmented partial residual plot, a forward response plot and a lambda control-slider.
Dynamic plots with
Dynamic plots with
Dynamic plots with
Dynamic plots with
Augmented partial residual plots with
Dynamic plots with
Generated data from the model (3.1)
Case # | ||||
---|---|---|---|---|
1 | 5.54 | 5.11 | −1.00 | 123059 |
2 | 5.05 | 5.98 | −0.96 | 154554 |
3 | 4.12 | 6.38 | −0.92 | 85194 |
4 | 4.55 | 7.61 | −0.88 | 367573 |
5 | 4.44 | 4.88 | −0.84 | 24570 |
6 | 3.99 | 4.55 | −0.80 | 9388 |
7 | 5.57 | 5.13 | −0.76 | 76244 |
8 | 6.21 | 5.31 | −0.71 | 172704 |
9 | 4.56 | 4.66 | −0.67 | 15879 |
10 | 5.57 | 3.41 | −0.63 | 11464 |
11 | 4.50 | 5.84 | −0.59 | 44631 |
12 | 4.39 | 4.95 | −0.55 | 15226 |
13 | 4.45 | 3.73 | −0.51 | 4719 |
14 | 6.04 | 6.61 | −0.47 | 392211 |
15 | 5.03 | 3.70 | −0.43 | 7701 |
16 | 4.21 | 5.70 | −0.39 | 24558 |
17 | 4.77 | 5.06 | −0.35 | 21159 |
18 | 4.05 | 5.67 | −0.31 | 17971 |
19 | 6.42 | 4.14 | −0.27 | 44608 |
20 | 4.97 | 4.19 | −0.22 | 10798 |
21 | 4.18 | 5.89 | −0.18 | 24561 |
22 | 5.13 | 5.29 | −0.14 | 30737 |
23 | 3.62 | 5.39 | −0.10 | 7731 |
24 | 6.09 | 5.68 | −0.06 | 132488 |
25 | 5.93 | 3.47 | −0.02 | 11718 |
26 | 5.30 | 4.35 | 0.02 | 16334 |
27 | 5.37 | 4.34 | 0.06 | 17862 |
28 | 5.15 | 4.56 | 0.10 | 17025 |
29 | 5.39 | 4.46 | 0.14 | 18966 |
30 | 5.01 | 5.03 | 0.18 | 24532 |
31 | 8.06 | 4.76 | 0.22 | 368547 |
32 | 4.25 | 4.31 | 0.27 | 5744 |
33 | 6.00 | 4.98 | 0.31 | 65102 |
34 | 4.22 | 5.55 | 0.35 | 19864 |
35 | 6.77 | 3.95 | 0.39 | 55872 |
36 | 5.00 | 5.00 | 0.43 | 24839 |
37 | 4.49 | 5.33 | 0.47 | 22878 |
38 | 3.76 | 6.51 | 0.51 | 38361 |
39 | 4.81 | 5.19 | 0.55 | 29470 |
40 | 4.52 | 4.45 | 0.59 | 10586 |
41 | 5.64 | 5.51 | 0.63 | 100145 |
42 | 5.74 | 1.80 | 0.67 | 2926 |
43 | 4.89 | 5.64 | 0.71 | 60870 |
44 | 5.42 | 5.67 | 0.76 | 101969 |
45 | 6.70 | 6.43 | 0.80 | 984019 |
46 | 4.20 | 4.48 | 0.84 | 12227 |
47 | 3.38 | 4.32 | 0.88 | 4709 |
48 | 6.27 | 4.70 | 0.92 | 133900 |
49 | 4.28 | 3.73 | 0.96 | 7554 |
50 | 4.45 | 6.24 | 1.00 | 122214 |
Generated data from the model (3.2)
Case # | ||||
---|---|---|---|---|
1 | 7.3 | 4.4 | 1.0 | 138.9 |
2 | 4.7 | 5.7 | 1.1 | 117.1 |
3 | 4.2 | 6.0 | 1.1 | 100.6 |
4 | 5.6 | 7.1 | 1.2 | 157.8 |
5 | 5.3 | 4.5 | 1.2 | 79.0 |
6 | 4.2 | 3.9 | 1.3 | 71.9 |
7 | 5.5 | 5.3 | 1.4 | 118.3 |
8 | 6.3 | 5.2 | 1.4 | 132.3 |
9 | 7.2 | 4.2 | 1.5 | 150.1 |
10 | 4.5 | 4.5 | 1.5 | 98.6 |
11 | 4.6 | 4.7 | 1.6 | 97.5 |
12 | 6.0 | 5.8 | 1.6 | 131.7 |
13 | 3.4 | 4.8 | 1.7 | 80.9 |
14 | 3.2 | 3.4 | 1.8 | 49.3 |
15 | 4.0 | 6.0 | 1.8 | 107.4 |
16 | 6.2 | 4.7 | 1.9 | 128.3 |
17 | 5.5 | 6.0 | 1.9 | 151.9 |
18 | 5.3 | 5.8 | 2.0 | 139.9 |
19 | 4.7 | 4.0 | 2.0 | 128.9 |
20 | 5.2 | 5.3 | 2.0 | 174.9 |