TEXT SIZE

CrossRef (0)
A cautionary note on the use of Cook’s distance

Myung Geun Kim

aDepartment of Mathematics Education, Seowon University, Korea
Correspondence to: 1Department of Mathematics Education, Seowon University, 377-3, Musimseo-ro, Seowon-gu, Cheongju-si, Chungcheongbuk- do 28674, Korea. E-mail: mgkim@seowon.ac.kr
Received April 6, 2017; Revised May 6, 2017; Accepted May 8, 2017.
Abstract

An influence measure known as Cook’s distance has been used for judging the influence of each observation on the least squares estimate of the parameter vector. The distance does not reflect the distributional property of the change in the least squares estimator of the regression coefficients due to case deletions: the distribution has a covariance matrix of rank one and thus it has a support set determined by a line in the multidimensional Euclidean space. As a result, the use of Cook’s distance may fail to correctly provide information about influential observations, and we study some reasons for the failure. Three illustrative examples will be provided, in which the use of Cook’s distance fails to give the right information about influential observations or it provides the right information about the most influential observation. We will seek some reasons for the wrong or right provision of information.

Keywords : case deletion, Cook’s distance, influence, regression
1. Introduction

In a regression context, there are many measures of the influence of observations on the least squares estimate of the parameter vector. Some of them can be found in Chatterjee and Hadi (1988) and Cook and Weisberg (1982). Since the change in the least squares estimate of regression coefficients due to case deletions is a vector quantity, it is usually normalized or scaled so that observations can be ordered in a certain way. In this vein Cook (1977) introduced an influence measure based on confidence ellipsoids.

The distribution of the change in the least squares estimator of the regression coefficients due to case deletions has a support set determined by a line in the multidimensional Euclidean space. Cook’s distance does not reflect this kind of distributional property, and thus it can reduce or enlarge the influence of an observation on the least squares estimate. As a result, the use of Cook’s distance is likely to underestimate the influence of an observation on the least squares estimate or overestimate it, which will be studied in Section 2. Hence the use of Cook’s distance may lead to a wrong detection of influential observations. In Section 3 we consider three illustrative examples. For two examples, the use of Cook’s distance fails to give the right information about influential observations, and for one example, it provides the right information about the most influential observation. We will seek some reasons for the wrong or right provision of information.

2. On Cook’s distance

A linear regression model of our interest can be expressed as

$y=Xβ+ɛ,$

where y is an n×1 vector of values of the response variable, X = (x1, . . . , xn)T is an n× p matrix of full column rank consisting of n measurements on the p fixed explanatory variables possibly including the intercept term, β = (β0, β1, . . . , βp−1)T is a p × 1 vector of unknown regression coefficients, and ɛ is an n × 1 vector of independent random errors, each of which has zero mean and unknown variance σ2. We write the least squares estimator of β as β̂ = (XTX)−1XTy which is an unbiased estimator of β and whose covariance matrix is given by cov(β̂) = σ2(XTX)−1. The n×n matrix H = (hi j) = X(XTX)−1XT is the hat matrix, and e = (e1, . . . , en)T = (IH)y is the residual vector. The mean of the residual vector e is zero and its covariance matrix is cov(e) = σ2(IH). An unbiased estimator of σ2 is given by σ̂2 = eTe/(np). More details can be found in Seber (1977).

The least squares estimator of β computed without observation r is written as β̂(r). Miller (1974) showed that

$β^-β^(r)=(XTX)-1xrer1-hrr, r=1,…,n.$

The mean vector of β̂β̂(r) is zero and its covariance matrix is

$cov(β^-β^(r))=σ21-hrr(XTX)-1 xrxrT(XTX)-1.$

The rank of cov(β̂β̂(r)) is one for nonnull xr. The only nonzero eigenvalue of cov(β̂β̂(r)) is $σ2xrT(XTX)-2xr/(1-hrr)$ and its associated eigenvector is (XTX)−1xr. When we denote by Vr a one-dimensional subspace generated by (XTX)−1xr of the p-dimensional Euclidean space, the subspace Vr is just a line along which the eigenvector (XTX)−1xr lies, and each β̂β̂(r) has a distribution with which a random variable takes on values in the set Vr with probability one. More details about the distribution of β̂β̂(r) can be found in Kim (2015).

In order to investigate the change in the value of β̂ due to a deletion of observation r, Cook (1977) introduced an influence measure based on confidence ellipsoids as follows:

$Dr=1pσ^2(β^-β^(r))T (XTX)(β^-β^(r)).$

Let XTX = GLGT be the spectral decomposition of XTX, where L = diag(l1, . . . , lp) is a p×p diagonal matrix consisting of the eigenvalues of XTX, G = (g1, . . . , gp) is a p × p orthogonal matrix, and gi is the eigenvector of XTX associated with the eigenvalue li. Then $XTX=∑i=1pligigiT$. Hence Dr can be expressed as

$Dr=1pσ^2∑i=1pli [(β^-β^(r))T gi]2=‖β^-β^(r)‖2pσ^2∑i=1pli cos2θri,$

where θri is the angle between β̂β̂(r) and gi, and ||β̂β̂(r)|| is the Euclidean norm of β̂β̂(r).

The set {g1, . . . , gp} forms an orthonormal basis for the p-dimensional Euclidean space. The coordinate of β̂β̂(r) with respect to the ith eigenvector gi is (β̂β̂(r))Tgi. In the light of equation (2.1) the terms li and (β̂β̂(r))Tgi for each r play a specific role in determining the magnitude of Dr. The adoption of XTX for scaling the Euclidean norm of β̂β̂(r) is not reasonable as explained in what follows. In real data analyses, the line Vr is not in general parallel to any of the eigenvectors g1, . . . , gp. Also, even in the case where the line Vr is almost parallel to one of the gi’s, it is nearly orthogonal to the other eigenvectors: for example, if the line Vr is almost parallel to gp, then the component of Dr associated with the eigenvector gp, lp[(β̂β̂(r))Tgp]2/pσ̂2 nearly makes a real contribution to the influence of observation r on β̂, while the component of Dr associated with the remaining eigenvectors, $∑i=1p-1li[(β^-β^(r))Tgi]2/pσ^2$ is likely to distort the influence of observation r on β̂. Most or all of the terms (β̂β̂(r))Tgi for each observation are computed in the outside of the set Vr over which β̂β̂(r) is distributed. As a result, the terms (β̂β̂(r))Tgi play a role of having the value of Dr reduced or enlarged depending on the values of li as compared with the real influence of observation r, which results in distorting the influence of observation r. Hence the use of Dr is likely to underestimate the influence of observation r on β̂ or overestimate it, and the information about influential observations that the Cook’s distance provides may not be reliable.

3. Three illustrative examples

We will apply the expressions of Dr in equation (2.1) to three data sets: the Hald data set (Draper and Smith, 1981) and the rat data set (Cook and Weisberg, 1982) which were analyzed also by Cook (1977), and the body fat data set (Neter et al., 1996, p. 261). Using the probabilistic behavior of β̂β̂(r) through the spectral decomposition of its covariance matrix cov(β̂β̂(r)), Kim (2015) introduced an influence measure $Mr=xrT(XTX)-2xr/(1-hrr)$ to investigate the influence of deleting an observation on the least squares estimate β̂, and the problem of deleting multiple cases was considered by Kim (2016). For these three data sets, the result based on the Dr values will be compared with that based on the Mr values.

### 3.1. Hald data

The regression model with the intercept term β0 is fitted to the Hald data set which consists of 13 observations on a single dependent variable and four independent variables. The estimated regression coefficients are β̂0 = 62.41, β̂1 = 1.55, β̂2 = 0.51, β̂3 = 0.10, and β̂4 = −0.14.

For the values of Dr, observation 8 has the largest value D8 = 0.394 and observation 3 has the second largest value D3 = 0.301. Based on the Dr values, observation 8 is thus identified as the most influential observation. However, the influence measure Mr shows that observation 3 has the largest influence on the least squares estimate of β (M3 = 879.74) but observation 8 is not significantly influential (M8 = 37.16). In order to seek some reasons for which the two results are contradictory to each other, we will investigate sources of the Dr values for observations 3 and 8. The eigenvalues of XTX and their associated eigenvectors are included in Tables 1 and 2, respectively.

• (a) Each row in Table 3 shows a normalized vector of each β̂β̂(r). The values of cos θri shown in Table 4 can be considered as a measure of closeness of β̂β̂(r) to the ith eigenvector gi of XTX. As β̂β̂(r) gets close to gi, the value of cos θri approaches to one. We note from Tables 2 and 3 that both vectors β̂β̂(3) and β̂β̂(8) are almost parallel to the last eigenvector g5 of XTX, which can be confirmed by the cos θri values shown in the last column of Table 4. The second to fifth columns of Table 4 show that both vectors β̂β̂(3) and β̂β̂(8) are almost orthogonal to each of the eigenvectors g1, . . . , g4 of XTX.

• (b) For observations 3 and 8, the ratios cos2θ8i/cos2θ3i listed in Table 5 are much larger than one for all i = 1, 2, 3, 4, and hence they show that observation 8 is located closer to all of four axes g1, . . . , g4 than observation 3.

• (c) Since the line V3 over which β̂β̂(3) is distributed is almost parallel to the eigenvector g5 of XTX, the component of D3 associated with the eigenvector g5 which is l5[(β̂β̂(3))Tg5]2/pσ̂2 nearly makes a real contribution to the influence of observation 3 on β̂. These components of Dr (r = 3, 8) are included in Table 6. Also, the proportion of l5[(β̂β̂(r))Tg5]2/pσ̂2 to Dr is about 79% for observation 3 and about 7% for observation 8. On the other hand, the line V3 is almost orthogonal to all of the eigenvectors g1, . . . , g4 of XTX, and therefore the component of D3 associated with the eigenvectors g1, . . . , g4 which is $∑i=14li[(β^-β^(3))Tgi]2/pσ^2$ is likely to distort the influence of observation 3 on β̂. Observation 8 can be interpreted similarly to observation 3. The difference of $∑i=14li[(β^-β^(r))Tgi]2/pσ^2$ between observations 3 and 8 is approximately −0.303, while the difference of l5[(β̂β̂(r))Tg5]2/pσ̂2 between observations 3 and 8 is approximately 0.210. The extent that the distance D8 distorts the influence of observation 8 on β̂ is far more severe than that of D3. Thus the component $∑i=14li[(β^-β^(r))Tgi]2/pσ^2$ plays a role of making the value of D8 large, while it plays a role of making the value of D3 relatively small. Hence the distance D8 enlarges the influence of observation 8 on β̂, while the distance D3 reduces the influence of observation 3.

This is a reason for which the use of the Dr values identifies observation 8 that is not significantly influential as the most influential one and it cannot detect observation 3 as the most influential one.

Even though the Dr value was introduced as an overall measure of the combined influence of observation r on all of the estimated regression coefficients, it would be desirable if the use of the Dr values reveals influential observations for each regression coefficient, but the use of the Dr values does not. The use of the Dr values asserts that deletion of observation 8 has the largest change in β̂. However, deletion of observation 8 does not bring about a significant change in either estimated regression coefficient, while deletion of observation 3 has the largest change in all of the estimated regression coefficients, as can be seen in what follows. Numerical computations of the values β̂kβ̂k(r), k = 0, 1, . . . , 4; r = 1, . . . , 13 show that deletion of observation 3 has the largest change in β̂k for all k = 0, 1, . . . , 4. Table 7 shows the change in β̂k due to deletion of each of observations 3 and 8. After removal of observation 8 from the sample, numerical computations based on the remaining sample of size 12 show that deletion of observation 3 still has the largest change in β̂k for all k = 0, 1, . . . , 4 as listed in Table 8. After removal of observation 3 from the sample, numerical computations based on the remaining sample of size 12 show that deletion of observation 4 has the largest change −75.77 in β̂0, deletion of observation 11 has the largest change 0.77 in β̂1, deletion of observation 4 has the largest change 0.79 in β̂2, deletion of observation 11 has the largest change 0.90 in β̂3, and deletion of observation 4 has the largest change 0.75 in β̂4. We note that the Mr values provide useful information about influential observations for each regression coefficient.

### 3.2. Body fat data

We fit the regression model with the intercept term β0 to the the body fat data set which has 20 measurements on a single dependent variable and three independent variables. The least squares estimates of the regression coefficients are β̂0 = 117.08, β̂1 = 4.33, β̂2 = −2.86, and β̂3 = −2.19.

Observation 3 has the largest value D3 = 0.299 and observation 1 has the second largest distance D1 = 0.279. The Dr values assert that observation 3 is the most influential observation. However, for the Mr values, observation 1 has the largest value M1 = 401.19 and observation 3 has M3 = 150.22, not the second largest value. We have contradictory results also for the body fat data. We will seek some reasons for this contradictory results by investigating sources of the Dr values for observations 1 and 3. Detailed computations will not be included here. The four eigenvalues of XTX are 81290.24, 294.25, 119.82, 0.00062. The eigenvector corresponding to the last eigenvalue is (0.99909, 0.03012, −0.02583, −0.01592). Euclidean norm ||β̂β̂(r)|| is 72.92 for observation 1 and 37.47 for observation 3.

• (a) An investigation of the closeness between a normalized vector of each β̂β̂(r) and each eigenvector of XTX shows that cos θr4 is −0.9999988 for observation 1 and 0.9999891 for observation 3, which implies that both vectors β̂β̂(1) and β̂β̂(3) are almost parallel to the last eigenvector g4 of XTX. Also, both vectors β̂β̂(1) and β̂β̂(3) are almost orthogonal to each of the remaining eigenvectors of XTX.

• (b) For observations 1 and 3, the ratio cos2θ3i/cos2θ1i is 5.09, 5.18, 15.26 for i = 1, 2, 3, respectively. Hence observation 3 is located closer to all of three axes g1, g2, g3 than observation 1.

• (c) In the light of the results in (a), among the components of Dr (r = 1, 3) given in the first expression of equation (2.1), only the component

$l4 [(β^-β^(r))Tg4]2pσ^2$

nearly makes a real contribution to the influence of observation r on β̂, and its value is 0.133 for observation 1 and 0.035 for observation 3. Also, the proportion of l4[(β̂β̂(r))Tg4]2/pσ̂2 to Dr is about 48% for observation 1 and about 12% for observation 3. On the other hand, since the line Vr (r = 1, 3) is almost orthogonal to all of the remaining eigenvectors g1, g2, g3 of XTX, the component of Dr associated with the eigenvectors g1, g2, g3 which is

$1pσ^2∑i=13li [(β^-β^(r))T gi]2$

is likely to distort the influence of observation r on β̂. The component $∑i=13li[(β^-β^(r))Tgi]2$ is 0.146 for observation 1 and 0.264 for observation 3. The extent that the distance D3 distorts the influence of observation 3 on β̂ is more severe than that of D1. The component $∑i=13li[(β^-β^(r))Tgi]2/pσ^2$ plays a role of making the value of D3 large, while it plays a role of making the value of D1 relatively small. Hence the distance D3 enlarges the influence of observation 3 on β̂, while the distance D1 reduces the influence of observation 1.

This is a reason why observation 3 has the largest Dr value, D3 = 0.299, though it is not identified as a significantly influential observation by the Mr values.

Furthermore, the Dr values do not provide useful information about influential observations for each regression coefficient but the Mr values do as can be seen in what follows. Numerical computations of the values β̂kβ̂k(r), k = 0, 1, 2, 3; r = 1, . . . , 20 show that deletion of observation 1 has the largest change in β̂k for all k = 0, 1, 2, 3: β̂0β̂0(r) is −72.86 for observation 1 and 37.44 for observation 3, β̂1β̂1(r) is −2.12 for observation 1 and 1.02 for observation 3, β̂2β̂2(r) is 1.88 for observation 1 and −0.87 for observation 3, β̂3β̂3(r) is 1.08 for observation 1 and −0.69 for observation 3. Observation 3 is identified as the most influential one by the Dr values but it does not have a significant influence on any estimate β̂k (k = 0, 1, 2, 3).

### 3.3. Rat data

The regression model with the intercept term β0 is fitted to the rat data set which consists of 19 measurements on a single dependent variable and three independent variables. The least squares estimates of the regression coefficients are β̂0 = 0.27, β̂1 = −0.02, β̂2 = 0.01, and β̂3 = 4.18.

For the Dr values, observation 3 has the largest value D3 = 0.930. For the Mr values, observation 3 has the largest value M3 = 1864.3. Both influence measures lead to the same conclusion that observation 3 is the most influential one. We will briefly seek some reasons for the same conclusion. Detailed computations will not be included here. The four eigenvalues of XTX are 565097.6, 20.5, 0.16, 0.003. The eigenvector g4 corresponding to the last eigenvalue is (0.0213, −0.0052, 0.0005, 0.9998). For observation 3, we have Euclidean norm ||β̂β̂(3)|| = 2.684.

The cosine of the angle between β̂β̂(3) and g4 is 0.9993, which implies that β̂β̂(3) is almost parallel to the last eigenvector g4 of XTX and that it is almost orthogonal to each of the remaining eigenvectors of XTX. Hence, among the components of D3, only the component l4[(β̂β̂(3))Tg4]2/pσ̂2 nearly makes a real contribution to the influence of observation 3 on β̂, and its value is 0.781. The component $∑i=13li[(β^-β^(3))Tgi]2$ is 0.149. Also, the proportion of l4[(β̂β̂(3))Tg4]2/pσ̂2 to D3 is about 84% and it is very high. Therefore the extent that the distance D3 reflects the real influence of observation 3 on β̂ is very high so that the value D3 can yield the same result as the value M3.

4. Concluding remarks

The distance Dr is defined by scaling β̂β̂(r) using the matrix XTX. Almost all of the eigenvectors XTX are not in general parallel to the line Vr over which β̂β̂(r) is distributed. The distance Dr inevitably includes the component associated with the axes other than the axis determined by the line Vr. This component of Dr can be a source of distorting the influence of observation r on β̂. Hence the information about influential observations that the Cook’s distance provides may not be reliable. The first two examples analyzed in the previous section show defects of the distance Dr as an influence measure, while the three examples show that the Mr values can be a useful influence measure.

TABLES

### Table 1

The eigenvalues of XTX

l1l2l3l4l5
44676.20605965.4221809.9521105.41870.0012

### Table 2

The eigenvectors of XTX

g1g2g3g4g5
−0.016990.003720.00004−0.011040.99979
−0.12789−0.04278−0.64590−0.75134−0.01028
−0.83968−0.50922−0.018120.18763−0.01030
−0.198420.072110.75572−0.61985−0.01052
−0.488810.85653−0.106650.12626−0.01010

### Table 3

Normalized vectors of β̂β̂(r)

Number
1−0.999820.007720.010840.007060.01158
20.99976−0.01306−0.01003−0.01155−0.00904
3−0.999790.010020.010210.010650.01013
4−0.999810.007990.010830.010080.00964
5−0.999840.001580.013040.001420.01216
60.99982−0.00691−0.01005−0.01004−0.01052
70.99981−0.00095−0.01376−0.01099−0.00820
80.99969−0.01306−0.00889−0.01656−0.00986
90.99979−0.01122−0.01009−0.00975−0.01024
10−0.999520.023440.008070.016040.00885
11−0.999720.012300.009410.015230.00958
12−0.999790.010580.010740.010150.00981
130.99981−0.00893−0.01118−0.00874−0.01007

### Table 4

The values of cos θri

ri

12345
1−0.000160.00085−0.001120.00436−0.99999
2−0.000190.000810.000900.002911.00000
30.000070.000100.000270.00010−1.00000
40.00017−0.000600.001190.00204−1.00000
5−0.000380.00009−0.001520.01296−0.99991
6−0.00053−0.00059−0.00178−0.002840.99999
70.000870.00295−0.00653−0.007130.99995
80.00025−0.00084−0.002820.006120.99998
9−0.00014−0.000130.001200.000241.00000
10−0.00030−0.00009−0.00415−0.01389−0.99989
11−0.000190.000270.00234−0.00467−0.99999
12−0.00019−0.00051−0.000450.00006−1.00000
130.000200.000540.00048−0.002281.00000

### Table 5

The ratios cos2θ8i/cos2θ3i

i
1234
12.873.6109.53,955

### Table 6

Two components of Cook’s distance for observations 3 and 8

rD$∑i=14li [(β^-β^(r))T gi]2/pσ^2$l5 [(β̂β̂(r))Tg5]2/pσ̂2
30.3010.0650.236
80.3940.3680.026

### Table 7

The change in β̂k due to deletion of each of observations 3 and 8

rβ̂0β̂0(r)β̂1β̂1(r)β̂2β̂2(r)β̂3β̂3(r)β̂4β̂4(r)
3−76.180.760.780.810.77
825.16−0.33−0.22−0.42−0.25

### Table 8

The change in β̂k due to deletion of observations 3 after removal of observation 8 from the sample

β̂0β̂0(r)β̂1β̂1(r)β̂2β̂2(r)β̂3β̂3(r)β̂4β̂4(r)
−49.170.500.500.540.50

References
1. Chatterjee, S, and Hadi, AS (1988). Sensitivity Analysis in Linear Regression. New York: Wiley
2. Cook, RD (1977). Detection of influential observation in linear regression. Technometrics. 19, 15-18.
3. Cook, RD, and Weisberg, S (1982). Residuals and Influence in Regression. New York: Chapman and Hall
4. Draper, NR, and Smith, H (1981). Applied Regression Analysis. New York: Wiley
5. Kim, MG (2015). Influence measure based on probabilistic behavior of regression estimators. Computational Statistics. 30, 97-105.
6. Kim, MG (2016). Deletion diagnostics in fitting a given regression model to a new observation. Communications for Statistical Applications and Methods. 23, 231-239.
7. Miller, RG (1974). An unbalanced jackknife. Annals of Statistics. 2, 880-891.
8. Neter, J, Kutner, MH, Nachtsheim, CJ, and Wasserman, W (1996). Applied Linear Regression Models. Irwin: McGraw-Hill
9. Seber, GAF (1977). Linear Regression Analysis. New York: Wiley