The most popular regression model for the analysis of time-to-event data is the Cox proportional hazards model. While the model specifies a parametric relationship between the hazard function and the predictor variables, there is no specification regarding the form of the baseline hazard function. A critical assumption of the Cox model, however, is the proportional hazards assumption: when the predictor variables do not vary over time, the hazard ratio comparing any two observations is constant with respect to time. Therefore, to perform credible estimation and inference, one must first assess whether the proportional hazards assumption is reasonable. As with other regression techniques, it is also essential to examine whether appropriate functional forms of the predictor variables have been used, and whether there are any outlying or influential observations. This article reviews diagnostic methods for assessing goodness-of-fit for the Cox proportional hazards model. We illustrate these methods with a case-study using available R functions, and provide complete R code for a simulated example as a supplement.
The proportional hazards model proposed by Cox (1972) has been widely used in modeling censored survival data. It is both easy to implement and easy to interpret, usually making it the biostatistician’s first model to attempt when faced with time-to-event or survival data. In addition, its usage has extended to fields beyond biostatistics, such as predicting bank failures in finance (Lane
Due to its pervasive applicability, before taking the results from a fitted Cox model as valid, one should address a few questions: is the proportional hazards assumption satisfied? Are the functional forms of the variables appropriate? Are there any outliers or influential observations? To answer these questions, multiple methods have been proposed, many of which rely on different types of residuals of the model.
In this article, we review methods for assessing goodness-of-fit of the Cox proportional hazards model. The rest of this article is organized as follows. In Section 2, we present the notation and important quantities in estimating a Cox model. In Section 3, we discuss the potential problems, and review the literature on tests and graphical methods proposed to identify them. The diagnostic plots and tests for each problem are illustrated with an application regarding dental clinic visits using existing R software in Section 4. We conclude with a brief discussion and suggestions for possible remedies when nonproportionality is identified.
Let
The Cox model as given in the 1972 manuscript specifies the hazard for individual
where
which is independent of time. In the case of a single binary predictor,
The Cox model has subsequently been extended to incorporate time-dependent covariates. We will use the notation
such that the model assumes the covariates have proportionate effects on the hazard function over time (see, e.g., Fisher and Lin, 1999).
Estimation of
where
Assume there are no ties in the failure time data. Let
where Δ
where
Differentiating (
where
The
where
We obtain the maximum partial likelihood estimator by solving the partial likelihood equation
The solution
We may use its inverse, ℐ−1(
Most diagnostic procedures for the Cox model are based on its residuals. There are multiple types of residuals defined for the Cox model, and they can often serve different purposes in model diagnostics. In this section, we review methods to identify (1) violations of the proportional hazards assumption, (2) appropriate functional forms of covariates, (3) outlying observations, and (4) influential observations. We demonstrate the usage of many of these methods in Section 4, and provide implementation details in the Supplemental Materials.
Schoenfeld (1980) proposed a chi-squared goodness-of-fit test statistic for the proportional hazards regression model which utilized a residual of the form
Let
which, when there are no tied event times, is indeed
In practice, we replace
Moreau
Grambsch and Therneau (1994) generalized the approach in Schoenfeld (1982) to test the proportional hazards assumption. Assuming the true hazard function is of the time-varying form
where
with
where
The most common choices of
where
Park and Hendry (2015) showed that the decision of time transformations can have profound implications for the conclusions reached. In addition, they suggested that prior to fitting the model, practitioners should first determine the levels of censoring in their data, as in some cases an alternative model might be more appropriate than the Cox model. Exploratory graphical analysis, such as histograms, should be used to see if there are any outlying survival times. If there are few outliers, the test of Grambsch and Therneau (1994) should be done using the untransformed time. Otherwise, the rank transformation is a better choice. They showed using simulations that, with low levels of censoring, the rank and the KM transformation perform approximately equally well. When the level of censoring increases, the rank transformation tends to outperform the KM and natural log transformations.
Keele (2010) pointed out that, while the test of Therneau and Grambsch has been widely used as it is easy to conduct and interpret, application of the test requires some care due to it being sensitive to several forms of misspecification. Omitted predictors, omitted interactions and nonlinear covariate functional forms can all significantly affect the test result. The paper also emphasized the importance of correcting the functional form for continuous covariates before checking for nonproportionality (see Section 3.2).
Winnett and Sasieni (2001) discussed situations in which the approach of Grambsch and Therneau (1994) might provide misleading estimates of time-varying coefficients and presented an example using Mayo clinic lung cancer data. They also suggested using a compromise between
Despite the fact that the test of Grambsch and Therneau (1994) allows for time-dependent covariates, Grant
Xue
Another residual that assists in evaluating the proportional hazards assumption is the Cox-Snell residual. Cox and Snell (1968) provided a general definition of residuals instead of limiting the scope to only linear models. Kay (1977) used the methods in Cox and Snell (1968) to derive the residuals for the proportional hazards regression model. The Cox-Snell residual for the
where Λ̂0 is the estimated cumulative baseline hazard, which can be obtained using the method of Breslow (1974) as
It was concluded that if the model was correctly specified and no observation was censored, the residuals should approximately exhibit the properties of a random sample of size
The martingale residual, which is a slight modification of Cox-Snell residual, also assists in assessing proportionality. It was first discussed by Lagakos (1981) and later by Barlow and Prentice (1988). Further work was done by Therneau
where
The martingale residual is defined as the martingale residual process at the end of the study, i.e.,
Asymptotically,
Lin
The process
Grønnesby and Borgan (1996) concluded that when
has an approximate
Marzec and Marzec (1997a) established the asymptotic behavior of processes based on sums of weighted martingale-transformed residuals. They developed Kolmogorov-Smirnov and Cramér-von Mises types of omnibus tests using the fact that, in special cases, they appear to be transformed Brownian motions or Brownian bridges. As the derivation is complicated, please see Marzec and Marzec (1997a) for further details.
In addition to formal tests, graphical methods to assess the proportional hazards assumption for categorical predictors have been developed by Cox (1979) and Arjas (1988). Hess (1995) summarized these methods and their extensions, including (1) plotting the Cox model’s estimated survival curves
The aforementioned graphical methods all have one common limitation: they only apply to categorical predictors that have only a few levels. If a predictor has many levels or is continuous, the survival curves and cumulative hazard functions would no longer be informative. Therneau and Grambsch (2000) suggested plotting the cumulative Schoenfeld residuals ordered by survival times against survival times. If the proportional hazards assumption holds, the cumulative sum should be a random walk starting and ending at 0. These plots, however, can be difficult to read.
Martingale residuals, defined in
Henderson and Milner (1991) noticed that plots of the martingale residuals against time, although useful, can exhibit systematic patterns which are not
Grambsch (1995) proposed two aspects from which the martingale residual plot in Therneau
is fitted, and the expected count for the
where
McCullagh and Nelder (1983) recommended plotting the partial residual against
The other aspect mentioned by Grambsch (1995) comes from the penalized likelihood approach of Hastie and Tibshirani (1993). They assumed that the functional form of covariate
which enables estimation of all functional forms at the same time. To avoid overfitting, they maximized the penalized partial likelihood with penalty
A plot of martingale residuals against the linear prediction
Inspired by the deviance residuals for GLM in McCullagh and Nelder (1983), Therneau
where
Noticing that deviance residuals do not have a reference distribution and the normal approximation can sometimes be unsatisfactory (Fleming and Harrington, 1991), Nardi and Schemper (1999) proposed two new types of residuals: (i) the log-odds residual
The score vector
where
In studying the influence of one observation, a general practice is to delete that observation, fit the model again, and compare the parameter estimates with those of the model fit on the complete data. Nevertheless, the Cox model is conceptually different from linear or generalized linear models in that it involves both parametric and nonparametric estimation. Therefore, an observation could be influential in terms of more than just regression coefficients. We review measures of both in this section.
Cain and Lange (1984) presented a method for approximating the influence of individual cases on the Cox model’s parameter estimates. Let
where
Notice that
The partial derivative
Let
We call
Reid and Crépeau (1985) presented influence functions for the Cox model to identify possible influential observations and gave the same statistic (
Storer and Crowley (1985) pointed out that a good estimate of
Pettitt and Daud (1989) discussed the disadvantages of the approaches that try to approximate
where
The weighting scheme of
Let
which is essentially the matrix form of
where
Weissfeld (1990) adopted the idea of Cook (1986) to measure the change in likelihood function by computing its curvature. Originally, in Cook’s work, the change could be caused by perturbations in the score vector or the covariates. Weissfeld (1990) proposed for the Cox model three ways to perturb the data: weighting the observations in the log partial likelihood using a vector
Barlow (1997) proposed a modification of the method in Pettitt and Daud (1989). Their approach replaces ℐ(
In addition to traditional delete-one approaches, Wei and Kosorok (2000) developed case interaction influence measures for unmasking observations masked by other observations in the Cox model. They proposed the following statistic to assess the joint influence of observations
where
Zhu
The dental restoration longevity data, provided by the University of Iowa College of Dentistry’s Geriatric and Special Needs (SPEC) Clinic (see Caplan
We identified 697 unique patients who went to the SPEC Clinic to treat their molars upon their first visit and received restoration in amalgam, composite, or glass ionomer. The follow-up of their visits began on the date of restoration. Any restoration that was replaced with another intracoronal or extracoronal restoration, accessed for endodontic therapy, or extracted was deemed to have undergone an event. If the restoration results in an event, the event date would become the end of follow-up. Restorations that did not incur an event are considered censored up to the date of the patient’s second-to-last visit to any College of Dentistry’s clinic. Among the 697 patients, 228 experienced an event during the follow-up, giving a censoring rate of 67.3%.
We considered the following covariates: Gender, Age at receiving restoration (centered and scaled), Occupation (Faculty, Non-faculty) and Size (Small, Medium, Large). Analysis was performed using the
As suggested in Section 3.2, we should determine the appropriate form of covariates to include in the model before testing for proportionality. Age is the only continuous covariate whose form needs to be assessed. We use the methods of Therneau
As suggested in Section 4.1, we consider including the square of Age (Age2) in the model. We fit two models: one with only linear Age effects (Model 1) and another model with linear and quadratic Age effects (Model 2) to assess the improvement to the model when correcting the functional form. To assess the proportional hazards assumption, we used the cox.zph function from the
The parameter estimates of Model 2 are summarized in Table 3. Restorations for males tend to fail later than for females, while restorations for older patients tend to fail sooner. Compared to large restorations, medium and small restorations are less likely to fail.
As mentioned in Section 3.1.1, when the proportional hazards assumption holds, the Schoenfeld residuals will be close to zero. Therefore a plot of the Schoenfeld residuals against survival times would be informative (Schoenfeld, 1982). Figure 2 shows the plots for Model 2. For all six covariates, the smoothed pointwise confidence bands are all around 0, which again confirms that there is no obvious evidence against the proportional hazards assumption.
For the three categorical covariates (Gender, Occupation, and Size), we also utilize the graphical methods in Section 3.1.4 to check the proportional hazards assumption. For each covariate, we plot the estimated survival curves (
As mentioned in Section 3.1, the martingale residual can be used to graphically assess the proportional hazards assumption as well. We plot the cumulative sum of martingale residuals ordered by Age in Figure 4. The curve fluctuates around zero as expected.
As suggested in Section 3.3, both the martingale residual and the deviance residual are useful for identifying outlying observations, but the deviance residual is less skewed and therefore more useful. We plot both residuals against the linear predictions,
We also use the log-odds residual and the normal deviate residual discussed in Section 3.3 to look for potential outliers. Both the log-odds residual in Figure 6(a) and the normal deviate residual in Figure 6(b) identify the same set of 67 potential outliers, which is bigger than the set of outliers identified by the deviance residual. This set, however, still consists of younger individuals (51.7 vs. 55.1) whose restorations failed very soon.
We use the methods in Section 3.4 to perform influential diagnostics. We first look at influence of observations on parameter estimates and plot the
We also present the likelihood displacement approach in Figure 8. The absence of particularly large likelihood displacements further confirms our conclusion from Figure 7.
With such wide usage across a variety of disciplines, the importance of Cox regression for modeling time-to-event data cannot be overstated. As a consequence, one must consider the validity of the results from such an analysis before making any conclusions. This article presents a review of diagnostic methods for the Cox proportional hazards model, including methods for identifying violations of the proportional hazards assumption, appropriate functional forms of continuous covariates, outlying observations, and influential observations. Using a non-linear functional form of covariates can often improve model fit, while detected outlying or influential observations should be investigated further before taking any action.
Violations of the proportional hazards assumption can be addressed in several ways, the most common include the use of time-varying coefficients and stratified models. Time-varying coefficients, which are left out in this article, make a major source of nonproportionality and are often the alternatives to test against when assessing the proportional hazards assumption. More flexible models that incorporate this effect have been studied by Murphy and Sen (1991), Hastie and Tibshirani (1993), Verweij and van Houwelingen (1995), Sargent (1997), Marzec and Marzec (1997b), Cai and Sun (2003), Tian
The Cox model has also been extended to the analysis of interval-censored survival data. Such models have been studied by Finkelstein (1986), Farrington (2000), Goggins and Finkelstein (2000) and recently Heller (2010). In particular, Farrington (2000) proposed the counterparts to the Cox-Snell, martingale, deviance, and Schoenfeld residuals and illustrated their usage in model diagnostics under the interval-censored framework.
The authors are very grateful to Professor Jun Yan for his helpful discussions and comments on an earlier version of this manuscript.
Articles and their functional forms of
Article | |
---|---|
Cox (1972), Gill and Schumacher (1987), Chappell (1992) | A specified function of time |
Schoenfeld (1980), Moreau | Piecewise constant on non-overlapping time intervals with the constants and intervals predetermined |
Harrell (1986) | |
Lin (1991) | The proposed test is equivalent to g(t) = t when the maximizer of a weighted partial likelihood, β̂w, is based on a one-step Newton-Raphson algorithm staring from β̂ |
Nagelkerke | Let |
Proportionality test results for Model 1 and Model 2
Model 1 | Model 2 | |||
---|---|---|---|---|
| | | | |
Male | 0.586 | 0.444 | 0.429 | 0.513 |
Age | 4.029 | 0.045 | 3.555 | 0.059 |
Age2 | - | - | 0.638 | 0.424 |
Non-Faculty | 0.429 | 0.513 | 0.558 | 0.455 |
SizeMedium | 1.560 | 0.212 | 1.711 | 0.191 |
SizeSmall | 0.298 | 0.585 | 0.416 | 0.519 |
Global | 6.788 | 0.237 | 6.932 | 0.327 |
Cox regression results for tooth restoration failure for the Model 2
Estimate | exp(Estimate) | Standard error | | ||
---|---|---|---|---|---|
Male | −0.221 | 0.802 | 0.137 | −1.612 | 0.107 |
Age | 0.206 | 1.228 | 0.0076 | 2.709 | 0.007 |
Age2 | −0.092 | 0.912 | 0.085 | −1.075 | 0.282 |
Non-Faculty | 0.116 | 1.123 | 0.146 | 0.795 | 0.427 |
SizeMedium | −0.140 | 0.869 | 0.165 | −0.850 | 0.395 |
SizeSmall | −0.510 | 0.601 | 0.169 | −3.018 | 0.003 |