
In survival analysis of observational data, the inverse probability weighting method and the Cox proportional hazards model are widely used when estimating the causal effects of multiple-valued treatment. In this paper, the two kinds of weights have been examined in the inverse probability weighting method. We explain the reason why the stabilized weight is more appropriate when an inverse probability weighting method using the generalized propensity score is applied. We also emphasize that a marginal hazard ratio and the conditional hazard ratio should be distinguished when defining the hazard ratio as a treatment effect under the Cox proportional hazards model. A simulation study based on real data is conducted to provide concrete numerical evidence.
Randomized experiments identify the causal effects of new treatments or new drugs in medical or pharmacological studies. However, randomly assigning treatments to experimental units may not be possible due to ethical or operational difficulties. To estimate causal treatment effects from the data obtained from observational studies, the covariate distributions in treatment groups should be adjusted to be balanced, so that the treatment effects are not confounded with the effects of covariates, or pretreatment variables. It is not easy to adjust covariate distributions when the number of covariates is high; however, the propensity scores proposed by Rosenbaum and Rubin (1983, RR83) can make the required adjustment effectively. RR83 and many succeeding studies covered a two-treatment-group case. However, there may be more than two treatment groups or levels, such as multiple doses of a drug. For a multi-valued treatment, it is not simple to extend the analysis procedure used for a binary treatment. Imbens (2000) extended the propensity scores of RR83 to a multi-valued treatment called generalized propensity scores. Methods for removing the confounding effect by matching or subclassification using generalized propensity scores were dealt with in Yang
This paper deals with the issue of estimating multi-valued treatment effect from the right-censored data obtained from observational study. We focus on IPW method. Austin (2013) discussed on the binary treatment effect using propensity scores. He compared the performance of matching, IPW and other methods through simulation. IPW methods are often used instead of matching when right censored data are given. Matching methods require a large pool of data to find pairs of units with similar characteristics. In survival analyses that require long-term follow-ups, it is not easy to get adequate data. The IPW method has relatively fewer problems with data size. The IPW method has strengths when the values of treatment and covariates change over time (Robins
The paper proceeds as follows. Section 2 describes the IPW method using generalized propensity scores. It also addresses the issue of marginal and conditional HRs. In Section 3, a simulation study based on real data clarifies the topics covered in Section 2. Section 4 summarizes the important points and also mentions limitations when analyzing observational survival data for causal reasoning.
To be able to estimate causal treatment effects from observational data, we need some crucial assumptions. To describe those assumptions some definitions come first. The observed data for each unit
The mean of the difference in responses obtained when two different treatment levels are assigned to the same unit can be defined as the treatment effect. In an observational study, if the observed value of
However, because
Therefore,
For causal reasoning to be possible in observational studies lacking randomization, a key assumption is required for
A more strong assumption (
This assumption indicates that every unit has a positive chance to be in any treatment group called an ‘overlap-of-covariate-distributions assumption’. Combining this assumption with the strong version of Assumption 2, RR83 named it the ‘strong-ignorability assumption’. Including as many observable pre-treatment variables as possible in the covariate vector
We examine two kinds of weights in the IPW method to estimate the causal treatment effect. The inverse probability weights assigned to each unit are determined by the treatment level applied to the unit and the covariate vector representing the unit characteristics. For example, the inverse probability weight applied to a unit with
The inverse probability weight
Hirano and Imbens (2001) explained why we need inverse probability weights for binary treatment case, which can be extended to multi-valued treatment case as follows.
The second equality is established by Assumption 1, the third equality by Assumption 2, and the fourth equality by the definition of the generalized propensity score. The above expression indicates that if the parameter to be estimated can be expressed by the expected values of the counterfactuals, the observed data multiplied by the inverse probability weights may be considered and analyzed as counterfactuals.
However, if
Here, the weighting means that a new pseudo dataset is composed, in which the number of observations increases or decreases by the weights. However, this is different from multiplying each observation by weights as in (
If non-stabilized weights
at treatment
Now we examine the size of pseudo dataset formed by two types of weights, the stabilized and the non-stabilized weights. The non-stabilized weights 1
A slight modification of the above derivation can show that
Pseudo data sets formed by the two weights are not only different in size, but also different in proportions of treatment levels. The proportion of
So, in pseudo dataset obtained by SW, the proportion of each treatment level is maintained as the same as in the observed data. However, for NSW, the distribution of
For NSW, the proportions of all the treatment levels are the same at 1
A large literature, like Cole and Hernan (2004), first describes the non-stabilized weight, and then explains that the stabilized weight can be used to reduce the variance of the treatment effect estimated by the non-stabilized weight. However, you can reduce the variance by multiplying by an arbitrary number less than 1 or by truncating when you have a value greater than a specified value, for example, 10. Therefore, this explanation is not accurate as the rationale of the stabilized weight. We have shown that the stabilized weight is more appropriate than the non-stabilized weight in that both the size and the proportion of each treatment level remain the same while achieving the objective of covariate distribution balance.
For censored data we usually do not represent the treatment effect as an average because the complete value may not be observed. The treatment effect on survival time can be expressed by using a hazard function in the Cox proportional hazards model:
where
is the hazard ratio obtained when treatment
We cannot interpret
However, assuming a Cox model for pseudo dataset, covariates may be included as:
The covariate vector
is the conditional hazard ratio. If the Cox model is fitted to the observed data, not the pseudo dataset, then the analysis model can be expressed as
Generally
As Austin (2013) noted, the conditional causal hazard ratio (
When censored data (
First, the multinomial logit model is fitted to estimate the general propensity score
The above model imposes a constraint of
If the programming language R is used as an analytical tool, the
Fit the Cox model with only a treatment variable
to the pseudo dataset generated using estimated stabilized weights. Then marginal hazard ratio
is fitted, a conditional hazard ratio
When you obtain estimates of the Cox model regression coefficient,
Simulations are conducted to confirm the contents of Section 2. The following is what we like to confirm through a simulation. First, when estimating causal effects from censored data obtained from observational studies, we check how wrong the results can be if we simply estimate them without distinguishing between observational and causal studies. Next, we check how different the results will be depending on if the non-stabilized or stabilized weights are used when applying the IPW method. We also look at what is appropriate as the standard error of the estimate.
To ensure that the simulations are not too distant from the actual situation, the simulations have been done based on the real data. A model was fitted to the real data and then the fitted model was assumed to be the true model, and data for the simulation were generated from it. Matheson
Matheson
The values of
In order to generate a response variable corresponding to the generated treatment
The causal effect of healthy lifestyle on mortality is defined as hazard ratio. The focus shall be on the marginal hazard ratio. This is because the conditional hazard ratio can be estimated without methodology for causal studies, such as the IPW, if the covariate vector satisfies the strong ignorability as described in the previous section. However, the marginal hazard ratio (
The model (
The overall procedure of the simulation is as follows.
0. The parameters of the true model for generating simulated data are determined by fitting the multinomial model and the Cox model to the given actual data. In addition, the true marginal hazard ratios are obtained by the process described above.
1. Generate treatment
2. Generating
3. Repeat the previous Step 1 and Step 2 1,000 times to compare the performance of three estimators of treatment effects, one without using the IPW method, one with the stabilized weights, and one with non-stabilized weights.
To compare the point estimation performance, three measures, bias, relative bias and mean squared error, are computed. Relative bias is calculated as the absolute value of the bias divided by the true marginal hazard ratio, then expressed as a percentage. Actual coverage probability and length of confidence interval are also obtained to compare the interval estimation performance.
The true marginal hazard ratios obtained in Step 0, exp(
Without the application of the IPW method, the marginal hazard ratio cannot be properly estimated. It can be expected that biased estimates will be obtained when model (
We now apply the IPW method, and report the performance of the two estimators obtained when SW and NSW are used. The performance of the two estimators are very similar in bias and mean square error. In the 1,000 simulations, we have found that 1,000 pairs of estimates at each of four levels were very similar. The pseudo dataset generated depends on the weight used (For example, if you use SW, the proportion of each treatment level is maintained as the same as in the observed data, but if you use NSW, the proportions are changed to be all equal). However, the covariate distribution balance in treatment levels is achieved no matter what weight is used (Figure 1). Consequently, there is little difference in the two point estimates of treatment effect by SW and NSW.
We now investigate the interval estimates of the treatment effects when using the IPW method. First, we compare the size of pseudo dataset. The average size of pseudo dataset is expected to be five times the sample size of 12,522, when non-stabilized weights are used. As expected, the average is 62610 with standard deviation of 7.2. When stabilized weights are used, the average is 12522 with standard deviation 0.5. Non-stabilized weights increase the size of pseudo dataset and therefore the standard error of the estimator is underestimated. An underestimated standard error would shorten the confidence interval and would not guarantee a nominal confidence level of 95%. The first row of Table 2 shows that the actual coverage probability dropped below 50%. From the second row, you can see that the problem is solved using the robust standard error. A data size problem does not occur if stabilized weights are used. The third row of Table 2 shows that, if you use stabilized weights, there is no problem with the actual coverage probability even if you do not use the robust standard error. However, the fourth rows of Table 2 and Table 3 show that the robust standard error slightly overestimates the standard error of the estimate, thereby increasing the actual coverage probability above the nominal confidence level and increasing the length. At least in this data there is no need to use the robust standard error if you use stabilized weights. However, when you use non-stabilized weights, you must use the robust standard error.
In survival analysis of observational data, the inverse probability weighting (IPW) method and the Cox proportional hazards model are widely used when estimating the effects of multiple-valued treatment. In this paper, the two kinds of weights have been examined in the IPW method, and the cautionary points have been emphasized when defining the treatment effect as a hazard ratio.
When the IPW method is applied in the survival analysis, the explanation of the valid reasons for the weights is confusing because different literature has a different perspective. In this paper we explain the IPW method from the aspect of composing a pseudo dataset where the confounding effect of covariates is eliminated. The two sets of pseudo data formed by the stabilized and the non-stabilized weights differ in terms of the proportion of the treatment level and the size, there is no difference in that the covariate distribution becomes balanced in treatment levels. It is recommended to use stabilized weights when using IPW method in survival analysis since there seems to be no reason to use non-stabilized weights.
It is important to distinguish between causal and associational hazard ratios when using the Cox model for survival data; however, it is also important to distinguish between marginal and conditional hazard ratios. Austin (2013) distinguished the two hazard ratios, but in some cases, it goes without clearly distinguishing them, as in Sugihara (2010). We emphasized the difference between the two hazard ratios again. The marginal hazard ratio cannot be estimated by fitting a Cox model with only a treatment variable or by fitting a model with covariates to the observed data. Causal inference methods, such as the IPW method, should be applied.
The above facts were confirmed through simulations based on real data. The standard error of the estimator must be estimated in order to perform hypothesis tests on the treatment effect or to obtain confidence intervals. You have to use the robust standard error when non-stabilized weights are applied. It was noted that the robust standard error may overestimate the true standard error when the stabilized weights are applied, but this cannot be generalized due to the limitations of the simulation.
All methods of causal reasoning from the observational study have limitations of relying on the assumption of strong ignorability. Sensitivity analysis for the assumption is sometimes performed because it is not possible to test from observed data whether the assumption holds or not, but sufficient research has not been done yet on the sensitivity analysis in survival analysis.
Point estimation performance: MSE, bias, relative bias in %
MSE | Bias (Relative Bias in %) | |||||||
---|---|---|---|---|---|---|---|---|
trt 1 | trt 2 | trt 3 | trt 4 | trt 1 | trt 2 | trt 3 | trt 4 | |
No IPW | 0.0875 | 0.0215 | 0.0207 | 0.0142 | −0.2560 (14.0) | −0.0927 (6.3) | −0.1041 (7.9) | −0.0726 (6.1) |
IPW: | 0.0303 | 0.0164 | 0.0127 | 0.0122 | 0.0044 (0.2) | 0.0049 (0.3) | 0.0037 (0.3) | 0.0050 (0.4) |
IPW: sw( | 0.0304 | 0.0165 | 0.0128 | 0.0112 | 0.0051 (0.3) | 0.0050 (0.3) | 0.0037 (0.3) | 0.0050 (0.4) |
Interval estimation performance: actual coverage probability of the 95% confidence interval
trt 1 | trt 2 | trt 3 | trt 4 | ||
---|---|---|---|---|---|
IPW: | likelihood | 0.420 | 0.456 | 0.454 | 0.473 |
robust | 0.978 | 0.974 | 0.972 | 0.973 | |
IPW: | likelihood | 0.957 | 0.946 | 0.946 | 0.942 |
sw( | robust | 0.977 | 0.975 | 0.972 | 0.974 |
Interval estimation performance: length of the 95% confidence interval (mean ± standard error)
trt 1 | trt 2 | trt 3 | trt 4 | ||
---|---|---|---|---|---|
IPW: | likelihood | 0.1796 ± 0.0006 | 0.1500 ± 0.0005 | 0.1371 ± 0.0005 | 0.1262 ± 0.0004 |
robust | 0.8149 ± 0.0029 | 0.5671 ± 0.0020 | 0.5021 ± 0.0018 | 0.4728 ± 0.0017 | |
IPW: | likelihood | 0.6821 ± 0.0025 | 0.4833 ± 0.0018 | 0.4300 ± 0.0016 | 0.4055 ± 0.0015 |
sw( | robust | 0.8164 ± 0.0029 | 0.5678 ± 0.0020 | 0.5026 ± 0.0018 | 0.4732 ± 0.0017 |