TEXT SIZE

search for



CrossRef (0)
On deletion diagnostics in regression
Communications for Statistical Applications and Methods 2025;32:81-89
Published online January 31, 2025
© 2025 Korean Statistical Society.

Myung Geun Kima, Kang-Mo Jung1,b

aSeowon University, Korea; bDepartment of Mathematics, Kunsan National University, Korea
Correspondence to: 1 Department of Mathematics, Kunsan National University, 558 Daehakro, Kunsan 54150, Korea.
Email : mgkim@seowon.ac.kr kmjung@kunsan.ac.kr
Received September 8, 2024; Revised October 22, 2024; Accepted October 22, 2024.
 Abstract
The change in the least squares estimator (LSE) of a vector of regression coecients due to a single case deletion has been often used for investigating the influence of an observation on the LSE. A normalization of the change in the LSE using the Moore-Penrose inverse of the covariance matrix of the change in the LSE is derived. This normalization turns out to be a square of the internally Studentized residual. It is shown that the numerator part of Cook’s distance does not in general have a chi-squared distribution. An elaborate explanation for the inappropriateness of the choice of the scaling matrix defining Cook’s distance is given. A new diagnostic measure is suggested by reflecting a distributional property of the change in the LSE due to a single case deletion. Three numerical examples are given for illustration.
Keywords : case deletions, Cook’s distance, Moore-Penrose inverse, rank of a covariance matrix, support of a distribution
1. Introduction

In regression, many diagnostic methods of identifying outliers or influential observations have been suggested (Chatterjee and Hadi, 1988; Cook and Weisberg, 1982). Among them, our interest will be confined to deletion diagnostic methods of investigating the influence of an observation on the LSE of a vector of regression coefficients. The change in the LSE of a vector of regression coefficients due to a single case deletion is a vector quantity and hence observations can not be ordered according to these changes. The changes in the LSE due to case deletions are usually normalized or scaled so that observations can be ordered based on these normalized or scaled changes.

We will derive a normalization of the change in the LSE using the Moore-Penrose inverse of the covariance matrix of the change in the LSE in Section 3.1. This normalized change becomes a square of the internally Studentized residual. Cook (1977) introduced a scaled distance by scaling the change in the LSE due to a single case deletion. In Section 3.2 we will show that the numerator part of Cook’s distance does not in general have a chi-squared distribution. The scaling matrix defining Cook’s distance does not reflect a distributional property of the change in the LSE due to a single case deletion. We will give an elaborate explanation for the inappropriateness of the choice of the scaling matrix defining Cook’s distance. In Section 4 by reflecting a distributional property of the change in the LSE due to a single case deletion, we will suggest a new diagnostic measure. This influence measure enables observations to be naturally ordered based on their influences and it avoids a normalizing or scaling process. In Section 5 three numerical examples are given for illustration.

2. Preliminaries

A linear regression model can be defined by

y=Xβ+ɛ,

where y is an n × 1 vector of response variables, X = (x1, . . . , xn)T is an n × p matrix of full column rank which consists of n measurements on the p fixed independent variables, β is a p × 1 vector of unknown regression coefficients, and ɛ = (ɛ1, . . . , ɛn)T is an n × 1 vector of unobservable random errors in which ɛi and ɛj are uncorrelated for all i, j (ij). Further it is assumed that each ɛi has mean zero and variance σ2.

The LSE of β is β̂ = (XTX)−1XTy which is an unbiased estimator of β and the covariance matrix of β̂ is cov(β̂) = σ2(XTX)−1. We write the hat matrix as H = (hi j) = X(XTX)−1XT. The residual vector is e = (e1, . . . , en)T = (InH)y, where In is the identity matrix of order n. An unbiased estimator of σ2 is σ̂2 = eTe/(np). More details can be found in Seber (1977).

3. Deletion diagnostic measures

The LSE of β computed without the ith observation is written as β̂(i). Then the change in β̂ due to a deletion of the ith observation is given by

β^-β^(i)=(XTX)-1xiei1-hii,         (i=1,,n),

whose derivation can be found in Miller (1974).

The mean vector of β̂β̂(i) is zero and its covariance matrix is

cov(β^-β^(i))=σ21-hiiVi,

where

Vi=(XTX)-1xixiT(XTX)-1.

An estimator of cov(β̂β̂(i)) is

cov(β^-β^(i))^=σ^21-hiiVi.

The rank of Vi is one for a nonzero vector xi. It is easily shown that xiT(XTX)-2xi is the only nonzero eigenvalue of Vi and its associated eigenvector is (XTX)−1xi. Two vectors β̂β̂(i) and the eigenvector (XTX)−1xi of Vi lie on the same line. For more details, refer to Kim (2015).

The change β̂β̂(i) is often used for investigating the influence of the ith observation on β̂ usually through a normalizing or scaling process as follows

(β^-β^(i))TM(β^-β^(i)),

where M is an appropriately chosen matrix of order p. In the following two subsections we will discuss two kinds of choices of M.

3.1. A normalization of β̂β̂(i) using the Moore-Penrose inverse of cov(β^-β^(i))^

The covariance matrix of β̂β̂(i) is singular so that it is not invertible. We follow the line of the proof of Theorem 5.1 in Schott (1997) to obtain the following Moore-Penrose inverse of Vi

Vi+=[xiT(XTX)-2xi]-2Vi.

Hence the Moore-Penrose inverse of cov(β^-β^(i))^ is computed as

[cov(β^-β^(i))^]+=(σ^21-hii)-1[xiT(XTX)-2xi]-2Vi.

By using the Moore-Penrose inverse of cov(β^-β^(i))^, a normalized distance between β̂ and β̂(i) is obtained as

(β^-β^(i))T[cov(β^-β^(i))^]+(β^-β^(i))=ei2σ^2(1-hii).

This normalization of β̂β̂(i) is just a square of the ith internally Studentized residual (see equation (4.6) in Chatterjee and Hadi (1988) for the internally Studentized residuals).

3.2. Cook’s distance

We will assume hereafter that the error terms have a normal distribution with mean zero and variance σ2. Based on a confidence ellipsoid for β, Cook (1977) introduced a diagnostic measure which can be expressed as

Di=1p(β^-β^(i))T[cov(β^)^]-1(β^-β^(i))=1pσ^2(β^-β^(i))T(XTX)(β^-β^(i))=1phii1-hiiei2σ^2(1-hii).

Cook’s distance Di is a scaled distance between β̂ and β̂(i) using the inverse of cov(β^)^.

3.2.1. On comparing Di to the percentiles of the central F-distribution

The quantity

1σ2(β^-β)T(XTX)(β^-β)

has a chi-squared distribution with p degrees of freedom. However, the numerator part of Di

1σ2(β^-β^(i))T(XTX)(β^-β^(i))

does not in general have a chi-squared distribution, which will be explained in this subsection. To this end, we will use Theorem 9.10 in Schott (1997) restated in the following lemma for easy reference.

Lemma 1. Assume that a random vector x is distributed as a p-variate normal distribution Np(0,Ω) with zero mean vector and positive semidefinite covariance matrixΩ. Let A be a p × p symmetric matrix. Then a quadratic form xTAx has a chi-squared distribution with r degrees of freedom if and only ifΩAΩAΩ = ΩAΩand tr(AΩ) = r.

Note that (β̂β̂(i)) has a p-variate normal distribution with zero mean vector and covariance matrix Vi/(1 – hii). By putting

Ω=11-hiiViand A=XTX

in Lemma 1, we have

ΩAΩAΩ=hii2(1-hii)3Viand ΩAΩ=hii(1-hii)2Vi.

When hii = 0 or 1/2, the first condition, ΩAΩAΩ = ΩAΩ, holds. The second condition becomes

tr(AΩ)=hii1-hii.

Thus we have the following theorem.

Theorem 1.For each i, the numerator part of Di

1σ2(β^-β^(i))T(XTX)(β^-β^(i))

has a chi-squared distribution with one degree of freedom only when hii = 1/2.

Cook (1977) suggests that each Di is compared to the percentiles of the central F-distribution F(p, np). Each Di does not strictly have an F-distribution (see p.120 of Chatterjee and Hadi, 1988). Also, Theorem 1 shows that the numerator part of Di does not have a chi-squared distribution except for the case of hii = 1/2. Hence the overall use of F-distribution as a distribution of Di is inappropriate.

3.2.2. On the choice of XTX as a scaling matrix

First, in the following theorem we state a distributional property of a random vector with a singular covariance matrix helpful for easy understanding of the rest of this subsection.

Theorem 2. Assume that a random vector x is distributed as a p-variate normal distribution Np(0,Ω), where the rank of the covariance matrixΩis q with 1 ≤ q < p. Then x takes values in the column space ofΩwith probability one.

Proof: Let the spectral decomposition of Ω be

Ω=ΓΛΓT,

where Γ is an orthogonal matrix with its kth column γk (k = 1, . . . , p) and Γ is a diagonal matrix diag(λ1, . . . , λq, 0, . . . , 0) with positive eigenvalues λ1, . . . , λq of Ω. The set {γ1, . . . , γp} forms an orthonormal basis for ℝp.

Since Ω is symmetric, the row space of Ω is identical to the column space of Ω. Let R(Ω) be the column space of Ω and N(Ω) be the null space of Ω. The set {γ1, . . . , γq} is an orthonormal basis for R(Ω), while the set {γq+1, . . . , γp} is an orthonormal basis for N(Ω). Since R(Ω) is the orthogonal complement of N(Ω), we have

xR(Ω)if and only if γjTx=0for all j=q+1,,p,

which yields

{xR(Ω)}=j=q+1p{γjTx0}.

For each j = q+1, . . . , p, the mean of γjTx is zero and its variance is γjTΩγj=0. Hence the probability that γjTx is equal to 0 is

P(γjTx=0)=1.

Thus it follows that P(xR(Ω)) = 1, that is, x takes values in the column space of Ω with probability one.

In association with Theorem 2, it is well known that a p-variate normal random vector with a singular covariance matrix Ω does not have a probability density function with respect to the usual Lebesgue measure on ℝp. However, it has a probability density function with respect to the Lebesgue measure restricted to the column space of Ω (Khatri, 1968) and the support of this distribution is the column space of Ω.

We consider the spectral decomposition of XTX expressed as

XTX=GLGT,

where L = diag(l1, . . . , lp) is a p × p diagonal matrix consisting of the eigenvalues of XTX, G = (g1, . . . , gp) is a p × p orthogonal matrix, and gk is the standardized eigenvector of XTX associated with the eigenvalue lk. Then each Di can be expressed as

Di=1pσ^2k=1plk[(β^-β^(i))Tgk]2.

The terms lk and (β̂β̂(i))Tgk play a specific role in determining the magnitude of Di for each i.

Since the rank of cov(β̂β̂(i)) is one, the column space of cov(β̂β̂(i)) is the line generated by the eigenvector (XTX)−1xi of Vi which is one-dimensional subspace of ℝp. Theorem 2 shows that β̂β̂(i) is distributed entirely along the line generated by (XTX)−1xi. This line is the support of the distribution of β̂β̂(i).

The set {g1, . . . , gp} is an orthonormal basis for ℝp. The line generated by basis vector gk, denoted by Lk, forms one coordinate axis in ℝp for each k. If the coordinate axis Lk does not coincide with the line generated by (XTX)−1xi, then the coordinate axis Lk excluding its origin lies outside the support of the distribution of β̂β̂(i). In this case the quantity (β̂β̂(i))Tgk which is just the coordinate of β̂β̂(i) along the coordinate axis Lk is probabilistically meaningless and the corresponding component of Di, lk[(β̂β̂(i))Tgk]2/pσ̂2 will become a partial source of distorting the real influence of the ith observation on β̂.

All the coordinate axes or p – 1 coordinate axes do not coincide with the line generated by (XTX)−1xi in which the random vector β̂β̂(i) takes values with probability one. Hence the distance Di inevitably includes the components lk[(β̂β̂(i))Tgk]2/pσ̂2 (for all k or ′p – 1′ k) associated with the coordinate axes Lk different from the line generated by (XTX)−1xi. These components of Di become a source of distorting the real influence of the ith observation on β̂ because the coordinates (β̂β̂(i))Tgk along the coordinate axes Lk different from the line generated by (XTX)−1xi are probabilistically meaningless. Hence the adoption of XTX as a matrix scaling the distance between β̂ and β̂(i) is not reasonable and in general the Cook’s distance does not measure correctly the influence of observations on β̂.

4. A new diagnostic measure

A probabilistically meaningful diagnostic measure used for investigating the influence of an observation on β̂ especially through the change β̂β̂(i) should be introduced inside the support of the distribution of β̂β̂(i). We note that the rank of cov(β̂β̂(i)) is one and only (XTX)−1xi is the eigenvector of Vi associated with a nonzero eigenvalue. The line generated by (XTX)−1xi that is the support of the distribution of β̂β̂(i) forms one coordinate axis in ℝp, while the other p – 1 coordinate axes excluding their origins lie outside the support of the distribution of β̂β̂(i) whatever they are. Hence only the coordinate of β̂β̂(i) along the line generated by (XTX)−1xi is probabilistically meaningful among p coordinates of β̂β̂(i) in ℝp by Theorem 2. This coordinate (or its absolute value) as a scalar represents naturally the influence of the ith observation on β̂ which the change β̂β̂(i) reflects in ℝp, and it is computed as

Ki=ei1-hii(XTX)-1xi,

where ||a||2 = aTa for a column vector a. It does not need a normalizing or scaling process. It is reasonable to use the quantity Ki as a diagnostic measure to investigate the influence of the ith observation on β̂. The quantities K1, . . . , Kn are natually ordered according to their magnitudes. A relatively large absolute value of Ki implies that the ith observation is potentially influential.

5. Numerical examples

5.1. Hald data

The regression model with the intercept term is fitted to the Hald data (Draper and Smith, 1981) with 13 observations on a single response variable and four independent variables. Index plots of the Di values and the Ki values for the Hald data are shown in Figure 1.

For the Hald data, our discussion is confined to observations 3 and 8. Cook’s distances show that observation 8 is the most influential (D8 = 0.39) and observation 3 is the next (D3 = 0.30). However, the Ki values show that observation 3 is the most influential (K3 = −76.20) and observation 8 is the next (K8 = −25.17). An analysis of the sources of the Di values for observations 3 and 8 shows that the D8 value enlarges the real influence of observation 8 on β̂, while the D3 value reduces the real influence of observation 3 (Kim, 2017). Hence the D3 value does not identify observation 3 as the most influential one even though the K3 value identifies observation 3 as the most influential one, and the D8 value identifies observation 8 as the most influential one even though observation 8 is not the most influential based on the Ki values.

5.2. Body fat data

We fit the regression model with the intercept term to the body fat data (Neter et al., 1996, p.261) with 20 observations on a single response variable and three independent variables. Index plots of the Di values and the Ki values for the body fat data are shown in Figure 2.

An analysis of the body fat data is confined to observations 1 and 3. Based on the Di values, observation 3 is the most influential (D3 = 0.30) and observation 1 is the next (D1 = 0.28). However, for the Ki values, observation 1 has the largest absolute value (K1 = −72.92) and observation 3 has K3 = −37.47, not the second largest absolute value. Observation 19 has the second largest absolute value (K19 = −45.59). An investigation of the sources of the Di values for observations 1 and 3 shows that the D3 value enlarges the real influence of observation 3 on β̂, while the D1 value reduces the real influence of observation 1 (Kim, 2017). Hence the D1 value does not identify observation 1 as the most influential one even though the K1 value identifies observation 1 as the most influential one, and the D3 value identifies observation 3 as the most influential one even though observation 3 does not have the largest absolute value based on the Ki values.

5.3. Rat data

The regression model with the intercept term is fitted to the rat data (Cook and Weisberg, 1982) with 19 observations on a single response variable and three independent variables. Index plots of the Di values and the Ki values for the rat data are shown in Figure 3.

For the rat data, we confine our discussion to observation 3. The Di values show that observation 3 is the most influential (D3 = 0.93). For the Ki values, observation 3 has the largest absolute value (K3 = 2.69). Both diagnostic measures lead to the same conclusion that observation 3 is the most influential. The extent to which the D3 value reflects the real influence of observation 3 on β̂ is very high (Kim, 2017). Hence the D3 value gives the same result as the K3 value.

Figures
Fig. 1. The Hald data.
Fig. 2. The body fat data.
Fig. 3. The rat data.
References
  1. Chatterjee S and Hadi AS (1988). Sensitivity Analysis in Linear Regression, New York, Wiley.
    CrossRef
  2. Cook RD (1977). Detection of influential observation in linear regression. Technometrics, 19, 15-18.
    CrossRef
  3. Cook RD and Weisberg S (1982). Residuals and Influence in Regression, New York, Chapman and Hall.
  4. Draper NR and Smith S (1981). Applied Regression Analysis (2nd ed), New York, Wiley.
  5. Khatri CG (1968). Some results for the singular normal multivariate regression models, Sankhyā Ser. A, 30, 267-280.
  6. Kim MG (2015). Influence measure based on probabilistic behavior of regression estimators. Computational Statistics, 30, 97-105.
    CrossRef
  7. Kim MG (2017). A cautionary note on the use of Cook’s distance. Communications for Statistical Applications and Methods, 24, 317-324.
    CrossRef
  8. Miller RG (1974). An unbiased jackknife. Annals of Statistics, 2, 880-891.
    CrossRef
  9. Neter J, Kutner MH, Nachtsheim CJ, and Wasserman W (1996). Applied Linear Regression Models.
  10. Schott JR (1997). Matrix Analysis for Statistics, New York, Wiley.
  11. Seber GAF (1977). Linear Regression Analysis, New York, Wiley.