TEXT SIZE

CrossRef (0)
Fused sliced inverse regression in survival analysis

Jae Keun Yoo

aDepartment of Statistics, Ewha Womans University, Korea
Correspondence to: 1Department of Statistics, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul 03760, Korea. E-mail: peter.yoo@ewha.ac.kr
Received July 21, 2017; Revised September 4, 2017; Accepted September 14, 2017.
Abstract

Sufficient dimension reduction (SDR) replaces original p-dimensional predictors to a lower-dimensional linearly transformed predictor. The sliced inverse regression (SIR) has the longest and most popular history of SDR methodologies. The critical weakness of SIR is its known sensitive to the numbers of slices. Recently, a fused sliced inverse regression is developed to overcome this deficit, which combines SIR kernel matrices constructed from various choices of the number of slices. In this paper, the fused sliced inverse regression and SIR are compared to show that the former has a practical advantage in survival regression over the latter. Numerical studies confirm this and real data example is presented.

Keywords : bivariate slicing, fused sliced inverse regression, sufficient dimension reduction, survival analysis
1. Introduction

Sufficient dimension reduction (SDR) in regression of Y ∈ ℝ1|X ∈ ℝp = (X1, …, Xp)T tries to replace the original p-dimensional predictors X with a lower-dimensional linear projection predictor without loss of information about the conditional distribution of Y|X. That is, SDR seeks to find M ∈ ℝp×q:

$Y⊥X∣MTX,$

where ⫫ stands for independence, and qp.

Statement (1.1) indicates that the conditional distributions of Y|X and Y|MTX are the same; therefore, the dimension of X is replaced by MTX without loss of information about Y|X. A subspace spanned by the columns of M satisfying (1.1) is called a dimension reduction subspace. If the intersection of all dimension reduction subspaces is a dimension reduction subspace, it is called the central subspace . The central subspace is minimal and unique with a restoration that forms the main stream of SDR literature. Hereafter, the true dimension and orthonormal basis matrix of will be denoted as d and η ∈ ℝp×d, respectively. The lower-dimensional linear projection predictor ηTX is called sufficient predictors.

One of the most popular SDR methods should be sliced inverse regression (SIR) (Li, 1991). Implementation of SIR requires a categorization of a response variable Y, called slicing. The selection of the appropriate number of slices are known to be often critical in the application results; in addition, there are no optimal or recommended choices that can be used as a thumb rule. Cook and Zhang (2014) propose a simple solution to overcome this problem by combining sample kernel matrices of SIR constructed from various numbers of slices. Cook and Zhang (2014) call the combining approach fused sliced inverse regression (FSIR). According to Cook and Zhang (2014), the FSIR results in robust basis estimates of to the different numbers of slices, varying 3 to 15.

This paper studies FSIR in survival regression (which is out of date) and compares it with SIR under consideration of various slicing schemes. Slicing with a two dimensional response is more complex than that with a dimensional response since the survival regression takes the observed survival time and the censoring indicator as response variables. Therefore, FSIR is expected to have potential advantages in robustness to the basis estimation of over SIR.

The organization of the paper is as follows. In Section 2, we discuss a fused sliced inverse regression along with applicability to survival regression. Some issues that can arise in slicing under survival regression are also discussed in the same section. Section 3 is devoted to numerical studies and presentation of a real data application. We summarize our work in Section 4. We will define the following notations, which will be used frequently throughout the rest of the paper. A subspace (B) stands for a subspace spanned by the columns of B ∈ ℝp×q. We also define that ∑ = cov(X).

2. Fused sliced inverse regression in survival regression

### 2.1. Fused sliced inverse regression

Before explaining SIR (Li, 1991), the predictor X is normalized to Z = ∑1/2(XE(X)). Letting be the central subspace for a regression of Y|Z, then the relationship that = ∑1/2 holds. Define ηz be p×d orthonormal basis matrix for . Consider the linearity condition: (C1) $E(Z∣ηzTZ)$ is linear in $ηzTZ$. According to Li (1991), a proper subspace of can be constructed under linearity condition:

$S(E(Z∣Y))⊆SY∣Z⇔S(Σ-1E(X∣Y))⊆SY∣X.$

For the exhaustive estimation of , it is typically assumed that (∑1E(X|Y)) = . An estimation of through E(X|Y) is called sliced inverse regression.

In population, the construction of E(Z|Y) should be done without assuming any specific distributional assumption of Y|Z. If Y is categorical with h levels, E(Z|Y = s) is the mean of Z within each category of Y. Following this idea, if Y is continuous or many-valued, Y is transformed to a categorized response with h levels. Then compute E(Z| = s) for s = 1, …, h. This categorization of Y is called slicing, which is done so that each category has an equal number of observations. The SIR constructs the kernel matrix of either MSIR = cov(E(Z|Y)) or MSIR = cov(E(Z|)). The sample algorithm for SIR is summarized as:

• Construct by partitioning the range of Y into h non-overlapping intervals to have equal numbers of the observations as much as possible. Denote ns as the number of observations for = s.

• Compute (Z| = s) = (1/ns)∑=si, where i = Σ̂1(Xi) and ns stands for the sample size in the s the slices.

• Construct SIR as: $M^SIR=cov^ {E (Z∣Y˜)}=∑s=1hnsnE^ (Z∣Y˜=s)E^ (Z∣Y˜=s)T.$

• Perform the spectral decomposition on SIR: $M^SIR:M^SIR=∑i=1pλ^jγ^jγ^jT$, where λ̂1 ≥ · · · ≥ λ̂p ≥ 0 are the ordered-eigenvalues of SIR and γ̂i, i = 1, …, p, is the corresponding eigenvector to λ̂i.

• Determine the structural dimension d. Let be an estimate of d.

• A set of the eigenvectors corresponding to the first largest eigenvalues is the estimate of ηz: $η^z=(γ^1,…,γ^d^).$

• Back-transform to obtain the sample basis estimate for η so that η̂ = Σ̂1/2η̂z.

In the implementation of SIR in practice, the construction of SIR is the main part, which depends heavily on the selection of h. It is therefore expected that the estimation results by SIR may be critically different from the numbers of slices.

### 2.2. Fused sliced inverse regression

To overcome the deficit that SIR is sensitive to the number of slices, Cook and Zhang (2014) propose a fused approach to combine the MSIR obtained from the various number of slices. First, define that

$MFSIR(h)=(MSIR(2),…,MSIR(h)),$

where $MSIR(h)$ stands for the kernel matrix of SIR with h slices.

Since $MFSIR(h)$ is a non-decreasing sequence of $MSIR(k)$, it is straightforward to hold that

$S(MSIR(k))⊆S(MFSIR(h)), k=2,…,h.$

The assumption of $S(MSIR(k))=SY∣Z$ then induces that

$S(MFSIR(h))=SY∣Z and Σ-12S(MFSIR(h))=SY∣X.$

Therefore, $MFSIR(h)$ is a kernel matrix, whose column spans . The estimation of through $MFSIR(h)$ will be called FSIR.

The sample version $M^FSIR(h)$ is constructed by replacing $MSIR(k)$ with $M^SIR(k)$:

$M^FSIR(h)=(M^SIR(2),M^SIR(3),…,M^SIR(h)).$

The inference of through $M^FSIR(h)$ is the same as that through SIR. First, spectral-decompose $M^FSIR(h)$ such that

$M^FSIR(h)=∑i=1pα^iφ^iφ^iT,$

where α̂1 ≥ ·· · ≥ α̂p ≥ 0 are the ordered-eigenvalues of $M^FSIR(h)$ and φ̂i, i = 1, …, p, is its eigenvector. Then, the set of the eigenvectors corresponding to the first largest eigenvalues is the estimate of ηz. Then, η is estimated by Σ̂1/2η̂z.

It should be noted that any choice of h in $M^FSIR(h)$ would give the same asymptotic results, but non asymptotic behavior can be affected by choices. According to Cook and Zhang (2014), the FSIR is quite robust to h, compared to SIR.

### 2.3. Application to survival regression

Survival regression is a study of the conditional distribution of survival time T, given a set of predictors X. The direct regression of T|X is not normally plausible since the true survival time T cannot be fully observed due to censoring. Instead, the data (Yi, δi,Xi), i = 1, …, n, are assumed as n i.i.d. realizations of (T,C,X), where Y = Tδ + C(1 − δ), δ = 0, 1 is an indicator variable with δ = I(C > T), C is a censoring time that is considered right-censoring. Accordingly, SDR in the survival regression seeks to estimate the central subspace :

$T⊥X∣ηTX.$

Since T is not fully unobservable, the direct applications of SIR and FISR on Y|X are not possible to recover . It is therefore necessary to establish a plausible regression up to recover . For this, consider a regression of (T,C)|X. By construction of (T,C)|X, we have . According to Cook (2003), the central subspace of the bivariate regression of (Y, δ)|X is informative to , because . It should be noted that the estimation of is plausible, because (Y, δ,X) are collected for survival analysis. Cook (2003) connects the two regressions of T|X and (Y, δ)|X under the following condition: (C2) CX|(ηTX, T). Then, condition C2 forces statement (2.1) equivalent to (T,C) ⫫ X|ηTX, that is, = . This then also directly implies that = . According to Cook (2003), the equality is expected to hold, because proper containment requires carefully balanced conditions. SIR and FSIR are then directly applicable with bivariate slicing of Y and δ to recover .

### 2.4. Some discussion on bivariate slicing

The bivariate slicing for Y and δ is more systematic than that for Y alone, which is done as follows. First, slice Y into h slices. Each slice is then divided into two groups based on the values of δ. So, total numbers of slices are h × 2. This slice scheme is quite troublesome in some cases because the slice cannot be divided into two groups if δ has only one value of either 0 or 1 in some slices. Chen et al. (1999) showed that this may lead biases in the process of the dimension reduction. It then becomes possible that the order of the slices can be changed. That is, slice Y within each group of δ. This slicing scheme subsequently then results in more unbalanced numbers for observations within each of the final slices, which is against the usual policy of the slice scheme that the SIR pursues. It is also not clear which of the two has more advantageous in practice. It is therefore expected that the SIR application to survival regression is more sensitive to the numbers of slices than to usual regressions to slice the response alone. This discussion implies that the FSIR has potential advantages over the SIR in survival regression.

3. Numerical studies and real data application

### 3.1. Numerical studies

We consider two types of survival regressions, which are accelerated failure time (AFT) and Cox-proportional Hazards (CPH) models. For both models, the predictors of X = (X1, …, X10)T were independently sampled from N(0, 1). For $η0=(1/55)(1,2,3,4,5,0,0,0,0,0)$, one-dimensional linear combination ηTX of X was constructed, so that η has the unit length.

With this predictor setting, the following AFT model, shown in Section 5 of Datta et al. (2007), was generated:

$T∣X~iidexp(ηTX+ɛ),$

where a random error ɛ, which is independent of X, was randomly sampled from N(0, 1). The survival regression of T|X then follows a log-normal distribution. A censoring variable C was also randomly generated from a log-normal distribution $exp{N(c02,2)}$ so that CX.

For CPH models, first, we followed an example in Section 4.2 of Yoo and Lee (2011). Its hazard and baseline hazard rates are λ = exp(ηTX) and λ0(t) = 1, respectively, where t stands for the value of survival time. The second CPH model is constructed by λ = exp(ηTX) and λ0(t) = ωtω1 for a hazard function and baseline hazard rates, respectively. In the second model, the hazard rate increases with ω > 1, and decreases with 0 < ω < 1. If ω = 1, the second CPH model is equivalent to the first one. For the CPH models, a censoring time C was randomly generated from Uniform(0, c0), which is independent of X. For the second CPH model, two values of 0.5 and 2 were considered for ω.

In the AFT and CPH models, the right-censoring scheme was considered, so the observed survival time Y is equal to min(T,C). The censoring status δ is also equal to 1, if C > T, and 0, otherwise. The sufficient predictor ηTX is supposed to be estimated better with smaller censoring percentages since censoring variables are generated independently of X. Therefore, as discussed in Section 2.3, the censoring percentages affect the asymptotic behaviors of SIR and FISR in the basis estimation. For this, c0 was selected to control the average censoring percentages. We considered 30%, 50%, 70% and 80% of censoring. For the second CPH model with ω = 0.5, the case of 80% censoring was ruled out, because numerical errors occurred too often.

To measure how well η is estimated, the trace correlation r (Hooper, 1959) between η and η̂ was considered:

$r=trace (η^(η^Tη^)-1η^Tη (ηTη)-1ηT).$

To covert the correlation (larger, better) to the distance (smaller to better), we define the trace distance of rD = 1 − r. With n = 100 and 1,000 iterations of each model, the averages of rD are computed and reported in Figures 13. In the figure, “Fk.i” and “Sk.i” stand for FSIR and SIR, respectively. The script i in “Fk.i” and “Sk.i” indicates the numbers of slices for bivariate slicing of Y and δ. For example, if i = 6, the number of slices of Y is equal to 3, because δ is categorical with two levels. We considered i = 6, 8, 10 and i = 4, 6, 8, 10 for SIR and FSIR, respectively. For “Fk.i” and “Sk.i”, k varies from 1 to 3 and 1 to 2, respectively. If k = 1, Y is sliced within each category of the censoring status δ. For k = 2, Y is sliced first, and then each slice is re-categorized based on the values of δ. If k = 3, the two cases of k = 1 and 2 are fused.

According to Figures 13, for the AFT and CPH models, the accuracy in the estimation of η deteriorates as the average censoring percentages increase. Comparing the sensitiveness to the number of slices between FSIR and SIR, SIR becomes more sensitive to the number of slices than FISR with larger average censoring percentages. FSIR remains quite robust regardless of the censoring percentages. For both SIR and FSIR, the slicing order of Y and δ seem not to matter in the estimation of η. The numerical studies meet our expectations and confirm the potential advantages of FSIR over SIR in survival regression.

### 3.2. Primary biliary cirrhosis data

For illustration purposes, we use primary biliary cirrhosis (PBC) data in Tibshirani (1997) and Yoo and Lee (2011). The data is collected at the Mayo Clinic between 1974 and 1986 and consists of the following 19 variables with 276 observations:

• Y = the number of days between registration and the earlier of death or censoring

• δ = 1, if Y is time to death; 0 otherwise

• X1 Treatment code: 1 = D-Penicillamine, 2 = placebo

• X2 Age in years

• X3 Gender: 0 = male, 1 = female

• X4 Presence of ascites: absent = 0 or present = 1

• X5 Presence of Hepatomegaly: absent = 0 or present = 1

• X6 Presence of spiders: 0 = no or 1 = yes

• X7 Presence of edema: absent and no diuretic therapy = 0, present but no diuretic therapy or edema resolved by diuretics = 0.5 or present despite diuretic therapy = 1

• X8 Serum bilirubin, in mg/dl

• X9 Serum cholesterol, in mg/dl

• X10 Albumin, in g/dl

• X11 Urine copper, in μg/day

• X12 Alkaline phosphatase, in U/liter

• X13 SGOT, in U/ml

• X14 Triglycerides, in mg/dl

• X15 Platelet count; coded value is number of platelets per cubic ml of blood divided by 1,000

• X16 Prothrombin time, in seconds

• X17 Histologic state of disease, graded 1, 2, 3, or 4

Tibshirani (1997) fitted the data to the CPH model with the 17 predictors, and Yoo and Lee (2011) applied SIR to the PBC data for model-free variable selection.

In the data, the censoring percentage is about 60%, so it is expected that SIR applications with 4, 6, 8, 10 slices yield similar basis estimates. The same slicing schemes as the numerical studies are used for FSIR and SIR. After conducting the various application of FSIR and SIR, the trace distances between $η^Fk.i(m)$ and $η^Sk,i(m)$ are computed given d = m for m = 1, 2, 3. Since d is not be determined in this example, the basis are estimated via various choices of d. The maximum value for d is three, because Sk.4 can yield only a three-dimensional basis. Table 1 reports the average and maximum distances for each of FSIR and SIR instead of comparing all trace distances. In Table 1, (·, ·) next to the distances stands for the cases to give the maximum distances.

Table 1 also indicates that FSIR is clearly more robust to slicing schemes for any choices of d than the SIR. Considering the averages, the FSIR and the SIR seem to have no notable differences for the slicing schemes. However, investigating the maximum distance, FSIR produce more robust estimates than SIR. Especially, with d = 2, the difference between FSIR and SIR in the maximum is 0.16, which seems non-trivial.

4. Conclusion

The goal of the paper is to study FSIR (Cook and Zhang, 2014) in survival regression. Bivariate slicing schemes can affect the estimation of the central subspace by a SIR (Li, 1991) since survival regression considers the observed survival time and censoring status as responses. This issue can be relieved by fusing SIR kernel matrices from the various bivariate slicing schemes. To reach the goal, the FSIR applies to survival regression by considering various slicing schemes, and is compared with SIR.

Numerical studies confirm that the FSIR has potential advantages in the basis estimation in survival regression over SIR. The real data application also shows that the FSIR results in more robust estimates than SIR.

FSIR cannot have a direct application to correlated or clustered survival data since the SIR is based on iid observations. A study along with this side is in progress.

Acknowledgments

The authors are grateful to two reviewers and Associate Editor for insightful comments to improve the paper. For the author Jae Keun Yoo, this work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Ministry of Education (NRF-2014R1A2A1A11049389 and 2009-0093827).

Figures
Fig. 1. Averages of the trace distances for the accelerated failure time model in Section 3.1.
Fig. 2. Averages of the trace distances for the Cox-proportional Hazards model with baseline hazards equal to 1 in Section 3.1.
Fig. 3. Averages of the trace distances for the Cox-proportional Hazards model with baseline hazards equal to ωtω1 in Section 3.1.
TABLES

### Table 1

The average and maximum trace distances for each of FSIR and SIR applications to PBC data in Section 3.2

Methodd = 1d = 2d = 3
AveragesFSIR0.0030.0130.025
SIR0.0180.0800.093

MaximumFSIR0.029 (F1.10 & F2.6)0.079 (F1.6 & F1.10)0.101 (F1.10 & F2.6)
SIR0.086 (S1.10 & S2.4)0.263 (S1.4 & S1.10)0.176 (S1.8 & S2.4)

FSIR = fused sliced inverse regression; SIR = sliced inverse regression.

References
1. Chen, CH, Li, KC, and Wang, JL (1999). Dimension reduction for censored regression data. The Annals of Statistics. 27, 1-23.
2. Cook, RD (2003). Dimension reduction and graphical exploration in regression including survival analysis. Statistics in Medicine. 22, 1399-1413.
3. Cook, RD, and Zhang, X (2014). Fused estimators of the central subspace in sufficient dimension reduction. Journal of the American Statistical Association. 109, 815-827.
4. Datta, S, Le-Rademacher, J, and Datta, S (2007). Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO. Biometrics. 63, 259-271.
5. Hooper, JW (1959). Simultaneous equations and canonical correlation theory. Econometrica. 27, 245-257.
6. Li, KC (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association. 86, 316-327.
7. Tibshirani, R (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine. 16, 385-395.
8. Yoo, JK, and Lee, K (2011). Model-free predictor tests in survival regression through sufficient dimension reduction. Lifetime Data Analysis. 17, 433-444.