TEXT SIZE

• •   CrossRef (0) More on directional regression  Kyongwon Kim1,a, Jae Keun Yoo2,a

aDepartment of Statistics, Ewha Womans University, Korea
Correspondence to: 1 Department of Statistics, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul 03760, Korea. E-mail: kimk@ewha.ac.kr
2 Departmentof Statistics, Ewha Womans University, 52, Ewhayeodae-gil, Seodaemun-gu, Seoul 03760, Korea. E-mail: peter.yoo@ewha.ac.kr
Received May 5, 2021; Revised May 24, 2021; Accepted May 24, 2021.
Abstract
Directional regression (DR; Li and Wang, 2007) is well-known as an exhaustive sucient dimension reduction method, and performs well in complex regression models to have linear and nonlinear trends. However, the extension of DR is not well-done upto date, so we will extend DR to accommodate multivariate regression and large p-small n regression. We propose three versions of DR for multivariate regression and discuss how DR is applicable for the latter regression case. Numerical studies confirm that DR is robust to the number of clusters and the choice of hierarchical-clustering or pooled DR.
Keywords : central subspace, fused sliced inverse regression, multivariate regression, pooled approach, sufficient dimension reduction
1. Introduction

As the so-called big data era is coming, high-dimensional data analysis is inevitable, and the meaningful reduction of data is essential to data analysis.

In regression of Y ∈ ℝr|X ∈ ℝp, sufficient dimension reduction (SDR) turns out to be effective, and it pursues to reduce the original p-dimensional predictors X to its lower-dimensional predictor ηTX without loss of information on the conditional distribution of Y ∈ ℝr|X ∈ ℝp, where r ≥ 1, p ≥ 2 and η ∈ ℝp×d with dp. The following conditional independence is equivalent to it,

$Y⫫X∣ηTX,$

where stands for statistical independence.

Such η to satisfy (1.1) can exist multiply, and then we have to choose the minimal one among them. The subspace spanned by the columns of the minimal η is called the central subspace . Throughout the rest of the paper, η and d will stand for an orthonormal basis matrix and the structural dimension of . The d-dimensional linearly transformed predictor ηTX is called sufficient predictors. For further insights about SDR, readers are recommended to read Yoo (2016ab).

For many SDR methodologies, sliced inverse regression (SIR; Li, 1991), sliced average variance estimation (SAVE; Cook and Weisberg, 1991) and directional regression (DR; Li and Wang, 2007) are popular among the others. The SIR considers the first inverse moments E(X|Y) and the SAVE does the second inverse moments cov(X|Y). So, SIR works well, when the regression has linear trend, while SAVE is better with non-linear trend. The DR, however, combines the first and second moments together, so it works best among the three with complex regression to have periodical and non-linear trends in regression.

Both SIR and SAVE has been extensively extended to cover many types of regression such as multivariate regression and large p-small n regression. So far, DR has not had much attention to this matter. The purpose of this paper is to extend the capability of the directional regression to these cases. For this, three versions of direction regression for multivariate regression will be proposed and we discuss how DR can be extended to the latter regression.

The organization of the paper is as follows. Direction regression is reviewed in Section 2. Section 3 is devoted to extending DR to multivariate regression, and its extension to large p-small n regression is discussed in Section 4. Numerical studies are presented in Section 5. We summarize our work in Section 6.

2. Directional regression

DR (Li and Wang, 2007) generalizes the ideas of inverse regression by considering conditional moments of X − X̃, instead of the conditional moment of X, where X̃ denotes an independent copy of X. Here, without loss of generality, we use a standardized version of predictor variables and denote this as Z. The fundamental idea of DR is to estimate a central subspace based on the empirical direction which can be written as {ZiZj|i j, i, j = 1, … , n}. Here, each empirical direction, ZiZj, contributes to describe the patterns of predictor variables and presents the distribution of the direction of Z. Instead of using {ZiZj|i j, i, j = 1, … , n} directly, DR uses rank 1 matrix

${(Zi-Zj) (Zi-Zj)T∣i≠j,i,j=1,…,n},$

to represent each empirical direction. Contour regression (CR; Li et al. , 2005) utilizes the empirical direction in (2.1) by regressing it within a tiny contour of Y, that is, {|YiYj| < ε| ij, i, j = 1, … , n}. Then, the subspaces from this regression form an orthogonal complement to the central subspace. However, this method includes massive computation as we need to calculate $(n2)$ empirical contour directions. DR adopt a more computationally efficient way by using

$E [(Z-Z˜) (Z-Z˜)T∣Y,Y˜],$

which can be explained as regressing Z directly to the paired response (Y, ), where and are independent copies of Y. Li and Wang (2007) present the orthogonal complement of a column space of (2.2) is a subspace of a central subspace under mild assumptions. DR can be formulated as a generalized eigendecomposition problems by using

$MDR=(2Ip-E [(Z-Z˜) (Z-Z˜)T∣Y,Y˜])2,$

as a candidate matrix. Here, similar to CR, the candidate matrix of DR involves Y and and the sample version of MDR can be represented as U or V statistic as,

${1n∑j=1n∑i=1n(2Ip-E [(Z-Z˜) (Z-Z˜)T∣Yi,Yj])2(V statistic),1(n2)∑i≠jn(2Ip-E [(Z-Z˜) (Z-Z˜)T∣Yi,Yj])2(U statistic),$

where both of them require O(n2) operation. However, the candidate matrix in (2.3) can be reformulated without (, ) and this is the explicit advantage of DR over CR. The kernel matrix of DR can be rephrased as

$MDR=2E[E [ZZT∣Y]]2+2E[E [Z∣Y]E [ZT∣Y]]2+2E [E [ZT∣Y]E [Z∣Y]E [Z∣Y]E [ZT∣Y]]-2Ip.$

This alternative version of the candidate matrix implies DR is a combination of SIR and SAVE as MDR hinges on E[Z|Y] and E[ZZT|Y]. Therefore, DR is a more comprehensive method than SIR and SAVE because (2.4) also indicates DR contains the subspaces of SIR and SAVE as its subspace. Furthermore, DR enjoys the same computational complexity as SIR and SAVE, O(n). Finally, under mild assumptions and if there is a nonzero basis with

$BT (2Ip-E [(Z-Z˜) (Z-Z˜)T∣Y,Y˜])2B>0,$

then, DR can exhaustively estimate a central subspace and this can be written as

$span (MDR)=SY∣Z.$

The following is the algorithm to calculate the sample version of the candidate matrix of DR.

• Step 1. Standardize (X1, … , Xn) to = (1, … , n).

• Step 2. Let H1, … , Hs be a partition of the range of Y. Compute,

$Λ^1h=En[Z^I(Y∈Hh)]/En[I(Y∈Hh)],Λ^2h=En[Z^Z^TI(Y∈Hh)]/En[I(Y∈Hh)].$

• Step 3. Let h = En[I(YJh)]. Compute,

$Γ^1=∑h=1sp^hΛ^2h2, Γ^2=(∑h=1sp^hΛ^1hΛ^1hT)2, Γ^3=(∑h=1sΛ^1hTΛ^1h)(∑h=1sΛ^1hΛ^1hT).$

• Step 4. Finally, compute,

$M^DR=2Γ^1+2Γ^2+2Γ^3-2Ip.$

3. Directional regression on multivariate regression

### 3.1. Hierarchical-slicing directional regression

Directional regression (DR) requires the slicing procedure for implementing it in practice. Therefore, even with multi-dimensional response regression, the theoretical foundation of DR is not changed. Rather, the slicing scheme must be changed in its implementation to afford the multivariate responses.

One simple solution to this is hierarchical-slicing. For example, consider four-dimensional responses Y = (y1, y2, y3, y4)T. First, slice one of the four responses first into h1 slices. Then, slice one of the other three responses within each slice. In this way, slice all response variables. It is simple, but there is cost. Even with doing 2 slices per responses, which is the minimal slice size, the total number of slices for the four dimensional responses is 24 = 16. That is, the number of slices increase exponentially. This implies that this hierarchical-slicing is effective to lower-dimensional responses, for example bivariate ones. In this paper, this slicing scheme will be considered only for r = 2, and this will be called hierarchical-slicing DR.

### 3.2. Hierarchical-clustering directional regression

In sliced-based approach, it should be noted that the slicing is required for easy computation of the inverse moments. Therefore, the slicing is done under the philosophy that similar values are in slices.

A good option to follow this philosophy for multivariate responses is clustering. The first attempt to cluster the responses for inverse regression is Stedoji and Cook (2004), in which K-means clustering algorithm is adopted. Recently, Yoo et al. (2020) suggests that Ward-hierarchical clustering would be better than the former clustering method in many cases. So, by accepting the suggestion by Yoo et al. (2020), the responses are replaced with their clusters resulted from Ward-clustering algorithm, and then follow usual DR implementation. This has possible advantages over hierarchical slicing. First, the numbers of groups do not increase exponentially. Second, various numbers of slices can be considered, for example, h = 2, 3, …, so small sample size per slice can be avoided. Third, all asymptotic results based on usual slicing scheme still hold. So, the clustering method is recommended for larger values of r than two. This approach will be called hierarchical-clustering DR.

### 3.3. Pooled directional regression

Although double-slicing or clustering methods are effective and efficient for multivariate responses, it is still problematic that some slices have small sample sizes.

One way to overcome this problem is to pool DR’s application results from each coordinate regression. This approach initially proposed in Yoo et al. (2010) by noting the following relationship between the central subspaces of Y|X and of the coordinate regression Yk|X, k = 1, … , r

$⊕k=1uSYk∣X⊆SY∣X,$

where is the central subspace of Yk|X and ⊕ denotes the direct sum among subspaces ( ).

It directly indicates that pooling the central subspace of Yk|X capture useful information on .

One deficit of this pooling approach is that the exhaustiveness of DR for the central subspace of Yk|X does not guarantee that of . However, according to Yoo et al. (2010), it would not be problematic in practice.

Let $MDR(k)$ be the population kernels of DR for Yk|X. Define $MpDR=1/r∑k=1rMDR(k)$ . Its sample algorithm is as follows.

• Step 1. Compute $M^DR(k)$ for Yk|X, k = 1, … , r, from the usual DR application.

• Step 2. Construct $M^pDR=1/r∑k=1rM^DR(k)$

• Step 3. Spectral-decompose pDR such that $M^pDR=∑i=1pλ^iγ^iγ^iT$, where λ̂1λ̂2 ≥ ··· ≥ λ̂p ≥ 0.

• Step 4. Let Γ̂d = (γ̂1, … , γ̂d) be the eigenvectors corresponding to the first d largest eigenvalues of pDR. Defining that η̂ =∑̂−1/2Γ̂d, is the estimate of .

This approach to estimate is called pooled DR.

4. Direction regression in large p–small n data

The so-called large p-small n regression is quite popular these days, and this is maybe why the dimension reduction is necessary. Since DR requires the inversion of the covariance of X, the direct application of DR to such data is not plausible.

In such case, employing seeded dimension reduction approach to DR would be a possible and realistic route. All quantities in the population DR kernel matrices are computed as X-scale, not Z-scale, and consider it as a seed matrix. Then, the seed matrix is successively projected to ∑, which results a basis of .

However, this is not recommended to do, because the dimension of the seed matrix is too high. Actually, the seed matrix is a p × p matrix, so it would be not good for seed matrix. Also, the DR kernel matrix involves ∑, so the estimation accuracy is not expected to be good. One way to avoid this is to follow Um et al. (2018)’s approach. First, replace the dimension of X with its u-dimensional principal components. The choice of u must be carefully chosen, considering both the eigenvalues of ∑, so that u becomes relatively smaller than n. Then, with u-dimensional principal components, DR is applied in usual way for the additional reduction under the corporation with responses.

The extension of DR to large p-small n regression is not considered in numerical studies in the following section, because it is heavily affected by initial reduction of principal components, and then all the other steps remain the same as Li and Wang (2007).

Lastly, one can develop regularized DR to embracing L-1 or L-2 penalty, and this is definitely a new load and leave it as an open work.

5. Numerical studies and data analysis

### 5.1. Numerical studies

For all numerical studies, the sample sizes were 100 with p = 6 and 200 with p = 20, and each simulation model was iterated 1000 times. We consider hierarchical-slicing, hierarchical-clustering and pooled DR along with the compatible versions of SIR (Setodji and Cook, 2004) and SAVE (Yoo et al., 2010). All dimension reduction methods under consideration is sliced-based, and the same numbers of slices were considered for 4 and 6 slices for p = 6 and 6 and 10 slices for p = 20.

For all simulation models, is spanned by the two columns of η1 = (1, 1, 1, 0, 0, … , 0)T and η2 = (1, 0, 0, 0, 1, 3, 0, … , 0)T. So, the true structural dimension is equal to two.

To measure how the dimension reduction methods under consideration for multivariate regression estimate well, we computed the trace correlation distance rD between η on η̂i, i = 1, … , 1000,

$rD=1-12∑j=12ψj,$

where ψ is the eigenvalues of η(ηTη)−1ηTη̂( η̂Tη̂)−1η̂T. Actually, $12∑j=12ψj$ is called trace correlation coefficient (Hooper, 1959), but it is subtracted from one to make smaller values of rD indicate better estimation.

All predictors Xi and random errors ɛi were independently generated from N(0, 1). And, we consider the following six coordinate regressions:

• $y1=0.4 (η1TX)2+3 sin (η2TX4)+0.2ɛ1$,

• $y2=3 sin (η1TX4)+3 sin (η2TX4)+0.2ɛ2$,

• $y3=0.4 (η1TX)2+abs (η2TX)12+0.2ɛ3$,

• $y4=3 sin (η2TX4)+0.2 (1+(η1TX)2)ɛ4$,

• $y5=η1TX+(η2TX4)2+0.2ɛ5$,

• $y6=η1TX+(1+η2TX)+0.2ɛ6$.

To compare the estimation performances among hierarchical-slicing, hierarchical-clustering and pooled DR, the following four bivariate regressions were constructed: M2-1: Y = (y1, y2)T; M2-2: Y = (y1, y4)T; M2-3: Y = (y2, y3)T; M2-4: Y = (y3, y4)T.

Next, to investigate how well the hierarchical-slicing, hierarchical-clustering and pooled DR and their compatible versions of SIR and SAVE estimate , the following four dimensional-response regressions were considered : M4-1: Y = (y1, y2, y3, y4)T; M4-2: Y = (y1, y2, y5, y6)T; M4-3: Y = (y3, y4, y5, y6)T.

Numerical studies for M2-1–M2-4 and M4-1–M4-3 are summarized by boxplots of rDs in

According to Figure 1, three versions of DR for multivariate regression show quite similar performances, although bivariate slicing would be recommended.

In case of four-dimensional regression, the following common aspects are observed. The DR and SIR are robust to hierarchical clustering or pooled approach. The SAVE is very sensitive to the choice of the approach, and the pooled approach is much better than hierarchical-clustering one. Also, with larger p, SAVE worsens dramatically. As expected, SAVE is sensitive to the number of clusters, and so is SIR, especially in M4-1. The multivariate SIR shows the best estimation results for most simulation models. From various numerical studies, it is confirmed that multivariate DR does not outperform the other SDR methods, but they still produce robust estimation results to the number of clusters and the dimension.

6. Discussion

Sufficient dimension reduction methodologies for multivariate regression and large p-small n regression have not been developed in many ways, even though high-dimensional response regression is getting popular in big data era. An exhaustive estimation method of directional regression (DR; Li and Wang, 2007) is attempted to fill this gap. According to Li and Wang (2007), DR turns out to be an effective dimension reduction tool, when the regression has complex structure to have linear and non-linear trends.

In this paper, we discuss how DR is extended to embracing multivariate regression and large p-small n regression. For multivariate regression, hierarchical clustering replaces usual slicing scheme used in univariate regression and combine all information resulted by DR from each coordinate regression.

Extension of DR to large p-small n regression would be good, as long as one adopts seeded dimension reduction approach (Cook et al. , 2007). However, since DR has to compute the covariance of X within the slice, the seed matrix constructed from DR would be not useful in practice. So, following Um et al. (2018), it is suggested that the predictors are initially reduced by principal component analysis, and then the principal components are re-reduced through DR.

Numerical studies show that multivariate versions of DR do not outperform the same versions of SIR and SAVE. This would be disappointing, but it is confirmed that DR is robust to the number of clusters and the choice of hierarchical-clustering or pooled DR.

The regularized version of DR is left as an open work.

Figures Fig. 1. two-dimensional response regression models; hDR.#, hierarchical-clustering DR with # clusters; hDR.##, hierarchical-slicing DR with (#, #) slices; pDR.#, pooled DR with # slices. Fig. 2. M4-1: Y = (y1, y2, y3, y4)T. Fig. 3. M4-2: Y = (y1, y2, y5, y6)T. Fig. 4. M4-3: Y = (y3, y4, y5, y6)T.
References
1. Cook RD and Weisberg S (1991). Comment: Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 328-332.
2. Cook RD, Li B, and Chiaromonte F (2007). Dimension reduction in regression without matrix inversion. Biometrika, 94, 569-584.
3. Hooper J (1959). Simultaneous equations and canonical correlation theory. Econometrika, 27, 245-256.
4. Li B and Wang S (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102, 997-1008.
5. Li B, Zha H, and Chiaromonte F (2005). Contour regression: a general approach to dimension reduction. The Annals of Statistics, 33, 1580-1616.
6. Li KC (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316-327.
7. Setodji CM and Cook RD (2004). K-means inverse regression. Technometrics, 46, 421-429.
8. Um HY, Won S, An H, and Yoo JK (2018). Case study: application of fused sliced average variance estimation to near-infrared spectroscopy of biscuit dough data. The Korean Journal of Applied Statistics, 31, 835-842.
9. Yoo JK (2016a). Tutorial: Dimension reduction in regression with a notion of sufficiency. Communications for Statistical Applications and Methods, 23, 93-103.
10. Yoo JK (2016b). Tutorial: Methodologies for sufficient dimension reduction in regression. Communications for Statistical Applications and Methods, 23, 95-117.
11. Yoo C, Yoo Y, Um HY, and Yoo JK (2020). On hierarchical clustering in sufficient dimension reduction. Communications for Statistical Applications and Methods, 27, 431-443.
12. Yoo JK, Lee K, and Woo S (2010). On the extension of sliced average variance estimation to multivariate regression. Statistical Methods and Applications, 19, 529-540.