The LSE of β computed without the ith observation is written as β̂(i). Then the change in β̂ due to a deletion of the ith observation is given by
β^-β^(i)=(XTX)-1xiei1-hii, (i=1,…,n),
whose derivation can be found in Miller (1974).
The mean vector of β̂ – β̂(i) is zero and its covariance matrix is
cov(β^-β^(i))=σ21-hiiVi,
where
Vi=(XTX)-1xixiT(XTX)-1.
An estimator of cov(β̂ – β̂(i)) is
cov(β^-β^(i))^=σ^21-hiiVi.
The rank of Vi is one for a nonzero vector xi. It is easily shown that xiT(XTX)-2xi is the only nonzero eigenvalue of Vi and its associated eigenvector is (XTX)−1xi. Two vectors β̂ –β̂(i) and the eigenvector (XTX)−1xi of Vi lie on the same line. For more details, refer to Kim (2015).
The change β̂ – β̂(i) is often used for investigating the influence of the ith observation on β̂ usually through a normalizing or scaling process as follows
(β^-β^(i))TM(β^-β^(i)),
where M is an appropriately chosen matrix of order p. In the following two subsections we will discuss two kinds of choices of M.
3.1. A normalization of β̂ – β̂(i) using the Moore-Penrose inverse of cov(β^-β^(i))^
The covariance matrix of β̂ – β̂(i) is singular so that it is not invertible. We follow the line of the proof of Theorem 5.1 in Schott (1997) to obtain the following Moore-Penrose inverse of Vi
Vi+=[xiT(XTX)-2xi]-2Vi.
Hence the Moore-Penrose inverse of cov(β^-β^(i))^ is computed as
[cov(β^-β^(i))^]+=(σ^21-hii)-1 [xiT(XTX)-2xi]-2Vi.
By using the Moore-Penrose inverse of cov(β^-β^(i))^, a normalized distance between β̂ and β̂(i) is obtained as
(β^-β^(i))T[cov(β^-β^(i))^]+ (β^-β^(i))=ei2σ^2(1-hii).
This normalization of β̂ – β̂(i) is just a square of the ith internally Studentized residual (see equation (4.6) in Chatterjee and Hadi (1988) for the internally Studentized residuals).
3.2. Cook’s distance
We will assume hereafter that the error terms have a normal distribution with mean zero and variance σ2. Based on a confidence ellipsoid for β, Cook (1977) introduced a diagnostic measure which can be expressed as
Di=1p(β^-β^(i))T[cov(β^)^]-1 (β^-β^(i))=1pσ^2(β^-β^(i))T (XTX) (β^-β^(i))=1phii1-hiiei2σ^2(1-hii).
Cook’s distance Di is a scaled distance between β̂ and β̂(i) using the inverse of cov(β^)^.
3.2.1. On comparing Di to the percentiles of the central F-distributionThe quantity
1σ2(β^-β)T (XTX) (β^-β)
has a chi-squared distribution with p degrees of freedom. However, the numerator part of Di
1σ2(β^-β^(i))T (XTX) (β^-β^(i))
does not in general have a chi-squared distribution, which will be explained in this subsection. To this end, we will use Theorem 9.10 in Schott (1997) restated in the following lemma for easy reference.
Lemma 1. Assume that a random vector x is distributed as a p-variate normal distribution Np(0,Ω) with zero mean vector and positive semidefinite covariance matrixΩ. Let A be a p × p symmetric matrix. Then a quadratic form xTAx has a chi-squared distribution with r degrees of freedom if and only ifΩAΩAΩ = ΩAΩand tr(AΩ) = r.
Note that (β̂ –β̂(i))/σ has a p-variate normal distribution with zero mean vector and covariance matrix Vi/(1 – hii). By putting
Ω=11-hiiVi and A=XTX
in Lemma 1, we have
ΩAΩAΩ=hii2(1-hii)3Vi and ΩAΩ=hii(1-hii)2Vi.
When hii = 0 or 1/2, the first condition, ΩAΩAΩ = ΩAΩ, holds. The second condition becomes
tr(AΩ)=hii1-hii.
Thus we have the following theorem.
Theorem 1.For each i, the numerator part of Di
1σ2(β^-β^(i))T (XTX) (β^-β^(i))
has a chi-squared distribution with one degree of freedom only when hii = 1/2.
Cook (1977) suggests that each Di is compared to the percentiles of the central F-distribution F(p, n – p). Each Di does not strictly have an F-distribution (see p.120 of Chatterjee and Hadi, 1988). Also, Theorem 1 shows that the numerator part of Di does not have a chi-squared distribution except for the case of hii = 1/2. Hence the overall use of F-distribution as a distribution of Di is inappropriate.
3.2.2. On the choice of XTX as a scaling matrixFirst, in the following theorem we state a distributional property of a random vector with a singular covariance matrix helpful for easy understanding of the rest of this subsection.
Theorem 2. Assume that a random vector x is distributed as a p-variate normal distribution Np(0,Ω), where the rank of the covariance matrixΩis q with 1 ≤ q < p. Then x takes values in the column space ofΩwith probability one.
Proof: Let the spectral decomposition of Ω be
Ω=ΓΛΓT,
where Γ is an orthogonal matrix with its kth column γk (k = 1, . . . , p) and Γ is a diagonal matrix diag(λ1, . . . , λq, 0, . . . , 0) with positive eigenvalues λ1, . . . , λq of Ω. The set {γ1, . . . , γp} forms an orthonormal basis for ℝp.
Since Ω is symmetric, the row space of Ω is identical to the column space of Ω. Let R(Ω) be the column space of Ω and N(Ω) be the null space of Ω. The set {γ1, . . . , γq} is an orthonormal basis for R(Ω), while the set {γq+1, . . . , γp} is an orthonormal basis for N(Ω). Since R(Ω) is the orthogonal complement of N(Ω), we have
x∈R(Ω) if and only if γjTx=0 for all j=q+1,…,p,
which yields
{x∉R(Ω)}=∪j=q+1p{γjTx≠0}.
For each j = q+1, . . . , p, the mean of γjTx is zero and its variance is γjTΩγj=0. Hence the probability that γjTx is equal to 0 is
P(γjTx=0)=1.
Thus it follows that P(x ∈ R(Ω)) = 1, that is, x takes values in the column space of Ω with probability one.
In association with Theorem 2, it is well known that a p-variate normal random vector with a singular covariance matrix Ω does not have a probability density function with respect to the usual Lebesgue measure on ℝp. However, it has a probability density function with respect to the Lebesgue measure restricted to the column space of Ω (Khatri, 1968) and the support of this distribution is the column space of Ω.
We consider the spectral decomposition of XTX expressed as
XTX=GLGT,
where L = diag(l1, . . . , lp) is a p × p diagonal matrix consisting of the eigenvalues of XTX, G = (g1, . . . , gp) is a p × p orthogonal matrix, and gk is the standardized eigenvector of XTX associated with the eigenvalue lk. Then each Di can be expressed as
Di=1pσ^2∑k=1plk[(β^-β^(i))Tgk]2.
The terms lk and (β̂ – β̂(i))Tgk play a specific role in determining the magnitude of Di for each i.
Since the rank of cov(β̂ – β̂(i)) is one, the column space of cov(β̂ – β̂(i)) is the line generated by the eigenvector (XTX)−1xi of Vi which is one-dimensional subspace of ℝp. Theorem 2 shows that β̂ – β̂(i) is distributed entirely along the line generated by (XTX)−1xi. This line is the support of the distribution of β̂ – β̂(i).
The set {g1, . . . , gp} is an orthonormal basis for ℝp. The line generated by basis vector gk, denoted by Lk, forms one coordinate axis in ℝp for each k. If the coordinate axis Lk does not coincide with the line generated by (XTX)−1xi, then the coordinate axis Lk excluding its origin lies outside the support of the distribution of β̂ –β̂(i). In this case the quantity (β̂ –β̂(i))Tgk which is just the coordinate of β̂ –β̂(i) along the coordinate axis Lk is probabilistically meaningless and the corresponding component of Di, lk[(β̂ –β̂(i))Tgk]2/pσ̂2 will become a partial source of distorting the real influence of the ith observation on β̂.
All the coordinate axes or p – 1 coordinate axes do not coincide with the line generated by (XTX)−1xi in which the random vector β̂ – β̂(i) takes values with probability one. Hence the distance Di inevitably includes the components lk[(β̂ – β̂(i))Tgk]2/pσ̂2 (for all k or ′p – 1′ k) associated with the coordinate axes Lk different from the line generated by (XTX)−1xi. These components of Di become a source of distorting the real influence of the ith observation on β̂ because the coordinates (β̂ – β̂(i))Tgk along the coordinate axes Lk different from the line generated by (XTX)−1xi are probabilistically meaningless. Hence the adoption of XTX as a matrix scaling the distance between β̂ and β̂(i) is not reasonable and in general the Cook’s distance does not measure correctly the influence of observations on β̂.