With the recent advances in computing technology, it has become possible to perform calculations and modeling on vast amounts of data that were difficult before. With high-dimensional data modeling, the so-called curse of dimension is often faced, and it is one of main issues in such data analysis.
In regression of
where ⫫ stands for statistical independence.
For further usage, for
When the dimension of
Here, our interest is given in KIR, which is one of the widely used SDR method in multivariate regression. The key-step in KIR is to do
The organization of the paper is as follows. Sliced inverse regression and hierarchical inverse regression are reviewed in Section 2. Section 3 is devoted to proposing pooled sliced inverse regression for multivariate regression and two fused approaches for multivariate regression. In Section 4, numerical studies and real data examples are presented. We summarize our work in Section 5.
Understanding sliced inverse regression (SIR) (
Letting
Step 1. Slice
Step 2. Standardize the predictors
Step 3. Calculate the sample means of
Step 4. Do the spectral decomposition of
Step 5. Let Γ̂
In the SIR algorithm, if following the slicing scheme for multivariate responses, it often faces the curse of dimensionality. For example, if there are five dimensional responses, the least number of slices should be 32(=2^{5}). If the number of observations in data is 50, some slices must have only one observation. Accordingly, this leads unreliable dimension reduction results. It is noted that grouping the observations based on their similarity of the response is essential in the slicing scheme. When
Instead, hierarchical clustering algorithms have nestness and reproducibility, and
For multivariate regression, once the responses are clustered, it replaces Step 1 in the SIR algorithm and follows the other steps in the same fashion. We call this approach
Although clustering methods are effective and efficient alternatives to the usual slicing scheme for multivariate responses, it is inevitable for some clusters to have small sample sizes.
To overcome this issue, the following relationship between the central subspaces of
where is the central subspace of
This relation was firstly observed and utilized by
Following this pooling idea, we newly introduce the following
Step 1. Construct
Step 2. Compute
Step 3. Do the spectral decomposition of
Step 4. Let Γ̂
Let
The case of
Based on this, we newly define
In (
Therefore,
The sample version
Let
Accordingly like (
and the following relation holds for the non-decreasing sequences of
By assuming that
The sample version
For multivariate regression, one can use KIR, FpSIR and FHIR. The method FpSIR is recommended as default among the three, because FpSIR provides quite good estimation performances in various numerical studies, which are given in the next section. The two methods of KIR and FHIR require clustering application, so it cannot be implemented for some data. Also, it is known that outliers often affect clustering results, which may induce undesirable clustering results. Then, KIR and FHIR possibly produce poor estimation of . So, one fit FpSIR first, and see the results. If the dimension reduction results are not satisfactory, then it should be compared with those of FHIR and KIR.
For all numerical studies, the sample sizes were 100, and each simulation model was iterated 1,000 times. To measure how the three methods of KIR, FHIR and FpSIR estimate well, absolute value |
The numerical studies are summarized by side-by-side boxplots of |
We considered the following two models, which were investigated in
where
All coordinate regressions in Model 1 have the common linear conditional mean of
Model 2 was designed to compare the estimation performances of the three methods for non-linear conditional means. Since the central subspace of Model 2 is spanned by the columns of (1, 0, 0, . . ., 0)^{T} and (0, 1, 0, . . ., 0)^{T}, which correspond to
Numerical studies for Models 1 and 2 are summarized in
For Model 2,
This numerical studies confirm that two proposed fused methods, especially FpSIR, outperform the existing KIR in the estimation of , so we can expect potential advantages of FHIR and FpSIR over KIR for the dimension reduction of predictors in multivariate regression.
For the illustration purpose, we considered a multivariate regression analyzed in
For this regression, KIR, FHIR and FpSIR were applied with 3, 6 and 9 clusters or slices. The estimated first and second sufficient predictors are reported in
For multivariate regression, there are few sufficient dimension reduction methods, although multi-dimensional responses become more popular nowadays in the so-called big data era. Existing inverse regression methods are still persuasive in multivariate regression, but they are sensitive to the number of clusters or slices. A fused approach recently developed by
Numerical studies confirm that both proposed fused methods provide robustness to choice of clusters or slices and improve the estimation of the central subspace over the existing
Theoretical asymptotics of sample kernel matrices for the dimension determination of fused hierarchical inverse regression and fused pooled sliced inverse regression should be studied and derived. Since the two proposed fused methods have similar kernel matrices, each kernel matrix is not independent. So, for theoretical development, dependency central limit theorem should be applied. This direction of research is in progress.
Dimension tests by the application of KIR, FHIR and FpSIR with 3, 6 and 9 clusters or slices: KIR#, KIR with # clusters; FHIR#, FHIR with # clusters; FpSIR#, FpSIR with # slices
KIR3 | KIR6 | KIR9 | FHIR3 | FHIR6 | FHIR9 | FpSIR3 | FpSIR6 | FpSIR9 | |
---|---|---|---|---|---|---|---|---|---|
H_{0} : | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
H_{0} : | 0.006 | 0.001 | 0.005 | 0.0012 | |||||
H_{0} : | NA | 0.972 | 0.770 | NA | 0.189 | 0.673 | 0.741 |