TEXT SIZE

search for



CrossRef (0)
A selective review of nonlinear sufficient dimension reduction
Communications for Statistical Applications and Methods 2024;31:247-262
Published online March 31, 2024
© 2024 Korean Statistical Society.

Sehun Janga, Jun Song1,a

aDepartment of Statistics, Korea University, Korea
Correspondence to: 1 Department of Statistics, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea. E-mail: junsong@korea.ac.kr
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1C1C1003647, No. RS-2023-00219212, and No. 2022M3J6A1063595) and a Korea University Grant (K2312771).
Received January 21, 2024; Revised January 30, 2024; Accepted January 30, 2024.
 Abstract
In this paper, we explore nonlinear sufficient dimension reduction (SDR) methods, with a primary focus on establishing a foundational framework that integrates various nonlinear SDR methods. We illustrate the generalized sliced inverse regression (GSIR) and the generalized sliced average variance estimation (GSAVE) which are fitted by the framework. Further, we delve into nonlinear extensions of inverse moments through the kernel trick, specifically examining the kernel sliced inverse regression (KSIR) and kernel canonical correlation analysis (KCCA), and explore their relationships within the established framework. We also briefly explain the nonlinear SDR for functional data. In addition, we present practical aspects such as algorithmic implementations. This paper concludes with remarks on the dimensionality problem of the target function class.
Keywords : nonlinear sufficient dimension reduction, central class, GSIR, GSAVE, KSIR, KCCA
1. Introduction

In the rapidly expanding field of data science, we frequently encounter high-dimensional datasets, which present challenges in extracting meaningful information due to the curse of dimensionality. Central to addressing these challenges is the concept of sufficient dimension reduction (SDR), which is a collection of methodologies about supervised dimension reduction without significant loss of regression information. More precisely, the classic linear SDR seeks to find a lower-dimensional space of linear transformation of a predictor random variable X without losing regression information. For example, let X and Y be random elements in 꽍p and 꽍. The goal of the linear SDR is to find β ∈ 꽍p×d, d < p such that

YXβX.

It implies that the relationship between Y and X is fully described d-dimensional random variable βTX. Since d < p, if β is available, we do not need to use all the variables in X. Instead, we replace it with βTX, which means that dimension reduction is possible. The subspace spanned by columns of β satisfying (1.1) is called the sufficient dimension reduction subspace and the minimal subspace of sufficient dimension reduction subspace is called the central dimension reduction subspace or central subspace, which is denoted by . The existence of the central subspace is described in Cook (1998).

The representative methods for estimating are the sliced inverse regression (SIR) introduced in Li (1991) and the sliced average variance estimation (SAVE) introduced in Cook and Weisberg (1991). Both methods are based on inverse moments: E(X|Y)), or var(X|Y). The term “inverse” comes from the fact that we normally deal with E(Y|X) in the regression setting. A more comprehensive review on sufficient dimension reduction is illustrated in Li (2018).

With increasing complexity, linear indices are not enough to capture core characteristics in the relation between response and predictor. Hence, there is a need to extend the linear SDR to the nonlinear SDR, which induces the following setup.

YXh1(X),,hd(X),

where hi is a function from the space in which X reside to 꽍 for i = 1, . . . , d. In this setting, the SDR subspace is not clearly defined since the subspace is a concept for linear mapping. Regardless, (1.2 implies that we only need d predictors, h1(X), . . . , hd(X), in describing the relation between Y and X ∈ 꽍p. Wu (2008) and Yeh et al. (2008) extend the SIR to the kernel sliced inverse regression (KSIR), which uses the kernel function to represent h1, . . . , hd. Specifically, they use the kernel function κ that generates a reproducing kernel Hilbert space (RKHS), 꼱, by which (1.2) can be reformulated as

YXh1,κ(·,X),,hd,κ(·,X).

Hence, we can conclude that the KSIR conducts linear SDR on 꼱. In other respects, Fukumizu et al. (2007) proposes the kernel canonical correlation analysis (KCCA). The goal is to find functions such that supf∈꼱X,g∈꼱Y cor(f(X), g(Y)), where 꼱X,꼱Y are RKHSs generated by κX, κY , respectively. Like KSIR, the KCCA is a linear problem on 꼱X ×꼱Y in that it seeks to maximize the correlation.

Lee et al. (2013) illustrates a general theory of nonlinear sufficient dimension reduction by introducing central σ-field as below.

YXG,

where is a sub σ-field of σ(X). Under mild conditions, there exists a unique minimal sub σ-field called by the central σ-field. However, is an algebraic structure, which is difficult to estimate directly. To estimate from data directly, Lee et al. (2013) provide the set of all -measurable, square-integrable functions and calls it the central class denoted by . To recover from data, the generalized sliced inverse regression (GSIR) and the generalized sliced average variance estimation (GSAVE) are developed. In addition, Li and Song (2017) adapts the GSIR and the GSAVE for functional data and develops f-GSIR and f-GSAVE then provides asymptotic theories for nonlinear SDR methods. More recently, Song et al. (2023) proposed a nonlinear extension of the principal fitted component, which is based on the likelihood of the predictors.

In this paper, we incorporate various nonlinear SDR methods into the framework of Lee et al. (2013). We show that the target function spaces of KSIR and KCCA are included in the central class and delineate the relationship between KCCA and GSIR. The result functions of KCCA and GSIR are both in the range of some bounded operators.

The section of our paper is as follows. In Section 2, we introduce the kernel trick used in RKSH-based methods. In Section 3, we deal with the framework of nonlinear SDR and theoretical results of GSIR and GSAVE. Then, in Section 4, we review the KSIR in the case the response variable is scalar. In Section 5, the KCCA for nonlinear SDR of multivariate response variable is considered. Next, Section 6 deal with the nonlinear SDR for functional data: f-GSIR. Implementations of GSIR, GSAVE, KSIR, and KCCA are provided in Section 7. Finally, the concluding remarks regarding the application and the dimensionality are described in Section 8.

2. Preliminary: Kernel trick

Kernel methods can be extended to nonlinear methods through the use of the so-called kernel trick. The underlying key idea of this technique involves converting the original dataset into a higher-order space composed of nonlinear functions of the original dataset. Linear methods are then applied in this feature space. The most commonly used feature space is the reproducing kernel Hilbert space (RKHS), which facilitates easier computation and theoretical justification through the representer theorem. Aronszajn (1950) originally developed the concept of RKHS, which has been widely applied in the probability space context. Schölkopf and Smola (2002) provide a concise introduction to the main ideas of statistical learning theory, support vector machines, and kernel feature spaces, and Berlinet and Thomas-Agnan (2011) provide a rigorous framework and related theories of the RKHS within the context of probability space.

The RKHS can be defined in various ways using equivalent statements. One illustration is that the RKHS is a Hilbert space with a reproducing kernel. It is a function that, given a space of functions, allows for the evaluation of any function in that space at any point. For example, suppose that we want to construct a RKHS over a set in a metric space and κ(·, ·) is a positive definite kernel such that . Then the RKHS over with the reproducing kernel can be represented by the completion of the following linear transformation of kernels

={i=1aiκ(·,si):i=1ai2κ(·,si)<,ai,siS},

where the inner product is,

i=1aiκ(·,si),j=1bjκ(·,tj)=i=1j=1aibjκ(si,tj).

Then the term reproducing comes from a property that for any f(·)=i=1aiκ(·,si), we have

f(s)=f,κ(·,s).

Now, suppose that we have a random sample {(X1, . . . , Xn)} in 꽍p. Then we map the original data to the RKHS such that {(κ(·, X1), . . . , κ(·, Xn)} in

n={i=1naiκ(·,Xi)}.

By treating κ(·, Xi) is a random element in 꼱n, a linear method applied to κ(·, Xi) has a form of i=1naiκ(·,Xi) which is a nonlinear method in terms of Xi.

3. Nonlinear sufficient dimension reduction

In this section, we introduce the framework for nonlinear SDR developed by Lee et al. (2013) and Li and Song (2017). Let (Ω,꽦 , P) be probability space and let (ΩX,꽦X), (ΩY ,꽦Y ), and (ΩXY,꽦XY) be measurable spaces, where ΩXY = ΩX×ΩY and 꽦XY = 꽦X×꽦Y . Let X, Y and (X, Y) be random elements that take in ΩXY and ΩXY, with distributions PX, PY and PXY. It is assumed that these distributions are dominated by σ-finite measures. Let σ-fields generated by random elements as σ(X) = X1(꽦X), σ(Y) = Y1(꽦Y ), and σ(X, Y) = (X, Y)1(꽦XY). Let PX|Y (·|·) : 꽦X × ΩY → 꽍 be the conditional distribution of X given Y.

Definition 1

A sub σ-fieldof ΩX is an SDR σ-field for Y versus X if it satisfies

YXG.

There are many SDR σ-field for Y versus X. Obviously, σ(X) satisfies (3.1), which means that there is no reduction. For this reason, it is natural to seek the minimal SDR σ-field. In Lee et al. (2013), it is shown that if the family of probability measures {PX|Y (·|y) : y ∈ ΩY } is dominated by a finite σ-finite measure, then there is a unique sub σ-field of σ(X) such that:

  • ;

  • if is a sub σ-field of σ(X) such that , then .

Hence, it is natural to define the minimal SDR σ-field to , and we call the central σ-field for Y versus X, and denoit it by .

For further theoretical developments, a mild condition of square integrability is assumed. Let L2(PXY), L2(PX), and L2(PY ) be the spaces of functions defined on ΩXY, ΩX, and ΩY that are square-integrable with respect to PXY, PX, and PY , respectively. In the context of SDR, a difference up to constants is irrelevant. Therefore, there is no problem to assume that all functions of L2(PXY), L2(PX), and L2(PY ) have zero mean. From now on, let L2(PX),L2(PY), and L2(PXY) denote centered space for L2(PX), L2(PY ), and L2(PXY), respectively. Since a σ-field is rather abstract concept, it is difficult to estimate it directly. Instead, we utilize a class of functions measurable to a target σ-field. Given a sub σ-filed of σ(X, Y), let be the class of -measurable functions in L2(PXY). For simplicity, we abbreviate to . It is easily checked that is a subspace of L2(PXY). Now, we introduce a target class of functions in the nonlinear SDR.

Definition 2

Letbe an SDR σ-field and be the central σ-field. Then is called an SDR class, and is called the central class. is denoted by .

The central class is the generalization of the central subspace in the classical linear SDR. In the linear SDR, we model

YXβX,

where X and Y are random elements in 꽍p and 꽍 with square-integrability, and β is a matrix in 꽍p×d. For β satisfying the condition (3.2), the corresponding SDR σ-field is σ(βTX), and the central σ-field is satisfies the condition (3.2)}. With this setup, we can represent the central class as , where fβ : 꽍p → 꽍 by x 넢 → βTx and is the central subspace of linear SDR.

As linear SDR, we can define the concept of unbiasedness and exhaustiveness for the nonlinear case.

Definition 3

A class of functions inL2(PX)is unbiased forif its members are-measurable, and exhaustive forif its members generate.

For the description of what classes of functions are unbiased, we introduce the operation 뒙. For two subspaces and of a generic Hilbert space 꼱, denote the subspace . Now, we specify an unbiased class of functions, which is called regression class.

Proposition 1.

(Theorem 2 of Lee et al. (2013))If the family {PX|Y (·|y) : y ∈ ΩY} is dominated by a σ-finite measure, then

L2(PX)[L2(PX)L2(PY)]SYX.

If some function fL2(PX)L2(PY), then f(X) is square-integrable, and cov(f(X), g(Y)) = 0 for any gL2(PY)

, which implies that E(f(X)|Y) = 0. It resembles the residual in a regression problem. Hence, we call L2(PX)[L2(PX)L2(PY)] regression class and denote it by 꽠Y|X.

Naturally, the next topic is what condition the regression class is exhaustive for the central class . To this end, we introduce the notion of complete classes of functions in L2(PX).

Definition 4

Letbe a sub σ-field. The classis said to be complete if, for any, E(g(X)|Y) = 0 a.s. P implies that g(X) = 0 a.s. P.

The completeness is satisfied in many situations that are common settings in SDR. For the details, see Lee et al. (2013). In classical statistics, a sufficient and complete statistic is the minimal sufficient statistic. A similar logic can be applied to the central class.

Proposition 2.

(Theorem 3 of Lee et al. (2013))Suppose {PX|Y (·|y) : y ∈ ΩY} is dominated by a σ-finite measure, andis a sub σ-field of σ(X). Ifis a complete and sufficient dimension reduction class, then

MG=YX=SYX.

Thus, we can conclude that, with completeness, the regression class is identical to the central class, and even without completeness, it is guaranteed that the regression class is at least included in the central class. Moreover, it turns out that the regression class can be specified as the range of some bounded linear operators. Through this representation, we can apply a simple spectral decomposition method for estimating the regression class; the generalized sliced inverse regression.

3.1. Generalized SIR - GSIR

Before dealing with GSIR, let us introduce necessary concepts; modulo constants. For A and B, we say that AB modulo constants if, for each fA, there is c ∈ 꽍 such that f + cB, and A is a dense subset of B modulo constants if (i) AB modulo constants and (ii) for each fB, there is a sequence {fn} ⊆ A and a sequence of constants {cn} ⊆ 꽍 such that {fn + cn} ⊆ A and fn + cnf in the topology for B. Let 꼱X and 꼱Y be Hilbert spaces of functions on ΩX and ΩY satisfying the condition:

  • X and 꼱Y are dense subsets of L2(PX) and L2(PY) modulo constants, respectively.

  • There are constants C1 > 0 and C2 > 0 such that var(f(X))C1fX2 and var(g(Y))C2gY2 for any f ∈ 꼱X and g ∈ 꼱Y.

In fact, condition 2 has a relationship with the boundedness of bilinear operators. Hence, we assign symbols to the class of bounded operators. For two generic Hilbert spaces 꼱1 and 꼱2, let 꽟(꼱1,꼱2) denote the class of all bounded linear operators from 꼱1 to 꼱2, and let 꽟(꼱1) abbreviate 꽟(꼱1,꼱1). We denote the range of a linear operator A by ran(A), the kernel of A by ker(A), and the closure of ran(A) by ran¯(A). Since the bilinear operator c : 꼱X ×꼱X → 꽍 defined by f, g 넢 → cov(f(X), g(X)) is bounded, there is the unique operator MXX ∈ 꽟(꼱X) such that 윩f, MXXgX = cov(f(X), g(X)). Similarly, there is the unique operator MYY ∈ 꽟(꼱Y) such that 윩f, MYYgY = cov(f(Y), g(Y)). From now on, let and be ran¯(MXX) and ran¯(MYY), respectively. In considering dependency, a difference up to constants can be negligible, and if f ∈ ker(MXX), then var(f(X)) = 0, which means that f is a constant function a.s. Hence, we restrict considered spaces to and . In fact, it is easily shown that GX=L2(PX) and GY=L2(PY). Specifically, fL2(PX) implies that cov(f(X), g(X)) = 0 for any gL2(PX), which is equivalent to var(f(X)) = 0. Moreover, var(f(X)) = 0 is equivalent to fker(MXX)=GX. Thus, it follows that GX=L2(PX). Similarly, we can derive that GY=L2(PY). Since L2(PX)=GXX and L2(PY)=GYY, we can use inner products of 꼱X and 꼱Y for all functions in L2(PX) and L2(PY). On the other hand, there is a unique cross-covariance operator MYX : 꼱X → 꼱Y such that cov(f(Y), g(X)) = 윩f, MYXgY for any f ∈ 꼱Y and g ∈ 꼱X, and it is represented as MYX=MYY1/2RYXMXX1/2, where RYX is called the correlation operator with properties of ||RYX||op ≤ 1 and ran(RYX)=ran¯(MYY) (Baker, 1973). By Li (2018), we have ran(MYX)ran¯(MYY)=L2(PY). Hence, we can define domain-restricted covariance operators.

Definition 5

We define the restricted covariance operators ΣXX:L2(PX)L2(PX),ΣYY:L2(PY)L2(PY), andΣYX:L2(PY)L2(PX)to be identical to MXX, MYY, and MYXon the restricted space.

f1,ΣXXf2X=f1,MXXf2X, 듼 듼 듼g1,ΣYYg2Y=g1,MYYg2Y, 듼 듼 듼g1,ΣYXf1Y=g1,MYXf1Y.

By the definitions of ∑XX and ∑YY, it is easily shown that ker(∑XX) = {0} and ker(∑YY) = {0}, which means that both operators are invertible. Using the inverse operators, we define the conditional expectation operator.

Definition 6

We call the operator ΣYY-1ΣYX:L2(PX)L2(PY)the conditional expectation operator, and denote it by EX|Y.

The reason why this operator is called the conditional expectation operator is as follows.

  • For any fL2(PX), (EX|Yf)(Y) = E(f(X)|Y).

  • For any gL2(PY),(EXY*g)(X)=E(g(Y)X).

That is because, for any gL2(PY),

cov ((EXYf)(Y),g(Y))=EXYf,ΣYYgY=ΣYXf,gY=cov (f(X),g(Y)),

which means that (EX|Yf)(Y) = E(f(X)|Y) + constant. Since EXYfL2(PY), the constant should be equal to −E(E(f(X)|Y)) = −E(f(X)) = 0. Thus, it follows that (EX|Yf)(Y) = E(f(X)|Y). With this fact, we can easily induce that (EXY*g)(X)=E(g(Y)X).

Proposition 3.

(Corollary 1 of Lee et al. (2013))For any f, gL2(PX),

g,EXY*EXYfL2(PY)=cov[E(g(X)Y),E(f(X)Y)].

Furthermore, EXY*EXYop1, thusEXY*EXY(L2(PY)).

Proof

(3.4) can be proved easily.

g,EXY*EXYfL2(PY)=EXYg,EXYfL2(PY)=cov [E(g(X)Y),E(f(X)Y)].

The right-hand side of (3.4) is greater than equal to var(E(g(X)|Y))1/2 var(E(f(X)|Y))1/2 by Cauchy-Schwarz inequality. In addition, it is greater than equal to var(g(X))1/2var(f(X))1/2=gL2(PY)fL2(PY). Hence, it follows that g,EXY*EXYfL2(PY)gL2(PY)fL2(PY), which means that EXY*EXYop1.

In linear SDR, cov[E(X|Y)] has a crucial role. Likewise, the operator EXY*EXY has a key relationship with the central class .

Proposition 4.

(Theorem 5 of Lee et al. (2013))Ifis complete, then

ran¯(EXY*EXY)=SYX.
Proof

f ∈ 꽠 if and only if fL2(PX) and E(f(X)|Y) = 0, which is equivalent to f ∈ ker(EX|Y). It follows that ker(EXY)=YX. Since ker(EXY*EXY)=ker(EXY), we have ran¯(EXY*EXY)=CYX. Considering the completeness of , ran¯(EXY*EXY)=SYX.

3.2. Generalized SAVE - GSAVEM

Without completeness, the GSIR is guaranteed to be unbiased. Lee et al. (2013) proposed the GSAVE that it can recover functions beyond the regression class. In the GSAVE, the working space is not restricted to centered space. Thus, we work on L2(PX), L2(PX). Define the noncentered conditional mean operator EXY:L2(PX)L2(PY) by, for any fL2(PX), gL2(PY),

g,EXYfL2(PX)=E(g(Y)f(X)).

From this definition, it is easily shown that (EXYf)(Y)=E(f(X)Y). Additionally, we introduce a new type of conditional variance operator.

Definition 7

For each y ∈ ΩY, the bilinear form

L2(PX)×L2(PX), 듼 듼 듼(f,g)[EXY(fg)-EXYfEXYg](y)

uniquely defines an operator VX|Y (y) ∈ 꽟(L2(PX)) via the Riesz representation. We call the random operator

VXY:ΩYB(L2(PY)), 듼 듼 듼yVXY(y)

the heteroscedastic conditional variance operator given Y.

In summary, there exists uniquely as operator VX|Y (Y) such that cov(f(X), g(X)|Y) = 윩f, VX|Y (Y)gL2 (PX). Let the operator S be E(VVX|Y)2 : L2(PX) → L2(PX), where V : L2(PX) → L2(PX) is the (unconditional) covariance operator as

f,VgL2(PX)=cov (f(X),g(X)).

Now, the reason the GSAVE works is provided.

Proposition 5.

(Theorem 7 of Lee et al. (2013))Supposeis nonrandom for any fSYX. Then ran¯SSYX.

The estimator derived from ran¯S is called generalized SAVE or GSAVE. The notable result is that GSAVE can cover functions outside the regression class.

Proposition 6.

(Theorem 8 of Lee et al. (2013))Supposeis nonrandom for any fSYX, then SYXran¯S.

Combining given propositions, the following relation is satisfied:

YXran¯SSYX.
4. Nonlinear SDR with a scalar response

In this section, we illustrate the extension of sliced inverse regression with the kernel trick introduced in Section 2. The kernel trick is applied to the predictor space, then use the relation between Y and .

4.1. Kernel sliced inverse regression

Let ΩY ⊆ 꽍 and ΩX ⊆ 꽍p. The kernel sliced inverse regression (KSIR) applies the reproducing kernel Hilbert space (RKHS) to SDR method. Let 꼱X be a RKHS generated by a positive kernel κX. The KSIR assumes the population model:

YXh1(X),,hd(X),

where {h1, …, hd} ⊂ 꼱X. Thus, the goal of KSIR is to estimate {h1, …, hd}, which is called e.d.r. directions.

Definition 8

If YX|h1(X), …, hd(X), then {h1, …, hd} is called the feature. e.d.r. directions, and the subspace 꼱 = span(h1, …, hd) is called the feature e.d.r. subspace.

Note that, by the reproducing property of kernel function hi(X) = 윩hi, κ(·, X)윪X, the model is represented as:

YXh1,κ(·,X)X,,hd,κ(·,X)X,

which implies that KSIR is an extension of linear SDR in that the inner product in 꽍p is changed to that of 꼱X. For further discussion, we assume that E(κX(X, X)) < ∞ to be followed:

  • There is a unique E(κX(·, X)) ∈ 꼱X such that, for any f ∈ 꼱X, E(f(X)) = 윩f, E(κX(·, X))윪.

  • There is a unique bounded linear operator ∑XX ∈ 꽟(꼱X) such that cov(f(X), g(X)) = 윩f, ∑XXgX for any f, g ∈ 꼱X.

For the success of KSIR, there is a sufficiency condition proposed by Yeh et al. (2008), and this condition is also just an extension of the linear conditional mean assumption in linear SDR.

Definition 9

For {h1, …, hp} ⊂ 꼱X, it is said to satisfy the linear design condition, if, for any f ∈ 꼱X,

E(f(X)h1(X),,hd(X))=c0+c1h1(X)++cdhd(X),

for some constants c1, c1, …, cp.

With this condition, the unbiasedness that is different from ours can be proved in the population level.

Proposition 7.

(Theorem 1 of Yeh et al. (2008))Assume the existence of a feature e.d.r. subspace span(h1, …, hd), and the linear design condition holds. Then the central inverse regression vector falls into the subspace span(∑XXh1, …, ∑XXhd); E(κX(·, X)|Y) − E(κX(·, X)) ∈ span(∑XXh1, …, ∑XXhd).

In a real application, it is difficult to deal with E(κX(·, X)|Y) directly. Instead, the discretized version of Y, E(κX(·, X)|YJj), is adopted, where J1, …, Jh is a partition of ΩY with P(YJj) > 0 for all j = 1, …, h. It can be shown that E(κX(·, X)|) − E(κX(·, X)) ∈ span(∑XXh1, …, ∑XXhd), where Y˜=Σj=1hjI(YJj). It follows that the closure of the range of the covariance operator for E(κX(·, X)|) is included in span(∑XXh1, …, ∑XXhd). Now we can construct the covariance operator of E(KX(·, X)|) as

ΣE(KX(·,X)Y˜)=j=1hP(YJj)[E(KX(·,X)YJj)-E(κX(·,X))][E(KX(·,X)YJj)-E(κX(·,X))].

Therefore, it is necessary to solve the eigenvalue problem, where kth eigenvector is obtained by

maxaXa,ΣE(KX(·,X)Y˜)aX

subject to 윩a, ∑XXaX = 1, 윩a, ∑XXajX = 0 for j = 1, …, k − 1. Let us include KSiR into the framework proposed by Lee et al. (2013). Let span(h1, …, hd) be the central subspace . Then, it follows that the central σ-field is σ(h1(X), …, hd(X)). Hence the central subspace which considers only linear combinations is the subset of the central class . In other words, we have . Since any measurable function g, , hence, it follows that . Therefore, it can be concluded that the KSIR is the unbiased method at the population level, however, it cannot be exhaustive for .

5. Nonlinear SDR in Euclidean space

When the response variable is not a real random variable, slicing the response becomes inapplicable. Moreover, as the data structure grows in complexity, there is a need for generalized theories applicable to Hilbert space random elements. GSIR and GSAVE, introduced in Section 3, represent nonlinear SDR methods within a generic Hilbert space. In this section, we introduce another type of nonlinear method, drawing on the concept of canonical correlation analysis. We will briefly review Kernel Canonical Correlation Analysis (KCCA), as detailed in Fukumizu et al. (2007).

Let 꼱X and 꼱Y be the reproducing kernel Hilbert spaces generated by positive definite kernels κX and κY. For the existence of mean element and covariance operator, it is assumed that E(κX(X, X)) < ∞ and E(κY (Y, Y)) < ∞. Then, we have, E(f2(X))=E(f,κX(·,X)X2)fX2E(κX(X,X))< for any f ∈ 꼱X. Similarly, we can show that for any g ∈ 꼱Y, E(g2(Y)) is also finite. More explicitly, there exist uniquely covariance operators MXX, MYY, and MXY such that cov(f(X), g(X)) = 윩f, MXXgX, cov(f(Y), g(Y)) = 윩f, MYYgY, and cov(f(X), g(Y)) = 윩f, MXYgX. For the connecting zero-mean and square-integrable space L2(PX) and L2(PY), 꼱X and 꼱Y are reset to ran¯MXX and ran¯MYY. Hence, it follows that 꼱XL2(PX) and 꼱YL2(PY), which implies that ∑XX, ∑YY, ∑XY on centered spaces are corresponding to MXX, MYY, MXY on original spaces. With this setup, KCCA seeks to solve the following problem:

maxfX,gY,f0,g0cov(f(X),g(Y))var(f(x))1/2var(g(Y))1/2.

In the population level, (5.1) is reformulated by RKHS form as

maxfX,gY,f0,g0f,ΣXYgXf,ΣXXfX1/2g,ΣYYgY1/2,

or equivalently,

supfX,gYf,ΣXYgX 듼subject to 듼 듼 듼f,ΣXYfX=1,g,ΣXYgY=1.

As with classical CCA, the solution of 5.3 is ven by the eigenfunctions corresponding to the largest eigenvalue of the following generalized eigenvalue problem:

ΣYXf=λ1ΣYYg, 듼 듼 듼ΣXYg=λ1ΣXXf.

As mentioned, ∑XY can be represented as ΣXY=ΣXX1/2VXYΣYY1/2, where VXY is the correlation operator with its norm being less than equal to 1. Let φ, ψ be the unit eigenfunctions of VXY corresponding to the largest singular value; that is,

φ,VXYψ=maxfHX=1,gHY=1f,VXYgHX.

From the equation 5.4, it is easy to see that the solution of the KCCA is given the inverse images

f=ΣXX-1/2φ, 듼 듼 듼g=ΣYY-1/2ψ.

At the sample level, we briefly introduce the estimating procedure. At first, let φn, ψn be the unit eigenfunctions corresponding to the largest singular value of the operator

Vn,XY=(Σn,XX+nI)-12Σn,XY(Σn,YY+nI)-12.

The empirical estimators fn and gn of KCCA are

fn=(Σn,XX+nI)-12φn, 듼 듼 듼gn=(Σn,YY+nI)-12ψn.

In the practical analysis, when the inverse of an operator is treated, εnI is added to make the operator nonsingular. Moreover, this term acts like a regularize that a result function can be smooth. The authors of Fukumizu et al. (2007) shows convergence of fn, gn under the condition for the speed of εn. The KCCA can be interpreted in terms of conditional expectation operator EXY=ΣYY-1ΣYX. In fact, the solution of KCCA f, g has the following relationship.

supv=u=1u,VXYuX=φ,VXYψX=ΣXX1/2f,VXYΣYY1/2gX=f,ΣXX-1ΣXYgX=f,EXY*gX.

It follows that f, g is the solution of the following problem:

maxEXYu,vY 듼 듼 듼subject to 듼 듼 듼u,ΣXXuX=v,ΣYYvY=1.

Since v = EX|Yv, u,ΣXXuX=u,uL2(PX), and v,ΣYYvY=v,vL2(PY), the above problem is reformulated as

maxu,EXY*EXYvY 듼 듼 듼subject to 듼 듼 듼u,uL2(PX)=v,vL2(PY)=1.

To achieve the maximization, u must be in ran(EXY*EXY). Consequently, fran(EXY*EXY)SYX.

6. Nonlinear SDR for functional data

This section covers the nonlinear SDR in the case that both Y and X are random functions. We introduce the functional GSIR (f-GSIR) developed by Li and Song (2017) briefly. Let TX and TY be subsets in 꽍k1 and 꽍k2. Let ΩX be a Hilbert space of functions from TX to 꽍p and ΩY be a Hilbert space of functions from TY to 꽍q. Note that ΩX and ΩY are both Hilbert spaces of functions. Since a similar procedure is applied to the Y case, we describe only the X case. For a f ∈ ΩX, we can represent f as a vector of functions: f = (f1, …, fp), where fi is a function from TX to 꽍. For simplicity, we assume that fi is a random function in 꼱0 which is a Hilbert space of function from TX to 꽍 for all i = 1, …, p. Let 윩·, ·윪0 is a inner product of 꼱0. We define 꼱 as p Cartesian product space of 꼱0, that is, 꼱 = 꼱0 × ··· × 꼱0. For f, g ∈ 꼱, the inner product of 꼱 is defined as f,g=j=1pfj,gj0. Note that 꼱0 can be constructed by RKHS with kernel function κTX : TX × TX → 꽍, hence, we call ΩX the first level RKHS. Naturally, we call 꼱X the second level RKHS. Li and Song (2017) constrains the kernel function κX to be connected with the inner product of 꼱.

Definition 10

We say that a positive definite kernel κ : 꼱 ×꼱 →꽍 is induced by 윩·, ·윪if there is a function ρ : 꽍3 → 꽍+such that, for any φ, ψ ∈ 꼱,

κ(φ,ψ)=ρ(φ,φ,φ,ψ,ψ,ψ).

We assume that κX is induced by 윩·, ·윪 and assume that 꼱X. In the same manner, κY and 꼱Y can be constructed. We additionally assume that

  • X and 꼱Y are dense subsets of L2(PX) and L2(PY) modulo constants, respectively.

  • There are constants C1 > 0 and C2 > 0 such that var(f(X))C1fX2 and var(g(Y))C2gY2 for any f ∈ 꼱X and g ∈ 꼱Y.

Then, there exist a unique element in 꼱X, E(κX(·, X)), such that E(f(X)) = 윩f, E(κX(·, X))윪X for any f ∈ 꼱X and a unique element 꼱Y, E(κY (·, Y)), such that E(g(Y)) = 윩g, E(κY (·, Y)) 윪Y for any g ∈ 꼱Y. Furthermore, there exist uniquely covariance operators MXX, MYY, and MXY such that cov(f(X), g(X)) = 윩f, MXXgX, cov(f(Y), g(Y)) = 윩f, MYYgY, and cov(f(X), g(Y)) = 윩f, MXYgX. Let X0=span¯{κX(·,X)-E(κX(·,X)):XX}, and Y0=span¯{κY(·,Y)-E(κY(·,Y)):YT}. Then, it can be shown that ran¯(MXX)=X0, and ran¯(MYY)=Y0. We denote the Moore-Penrose inverse of some operator A by A. We call MXXMXY the regression operator and denote it by RYX. The next proposition is a theoretical base of the functional GSIR (f-GSIR).

Proposition 8.

(Theorem 1 of Li and Song (2017))Under some mild conditions, we have ran¯(RYX*)cl(MXXSYX), where cl(·) means the closure. Furthermore, if is complete, then ran¯(RYX*)=cl(MXXSYX).

Since ran¯(RYX*)=ran¯(RYX*ARXY) for any invertible operator A:Y0Y0, Li and Song (2017) recommends choosing A to MYY to avoid inversion so that RYX*ARXY=MXYMYX. We call the following entire procedure f-GSIR:

Let f1, …, fd be solution to the following sequential eigenvalue problem; for each k = 1, …, d,

maxfX0f,MXXMXYMYXMXXfX

subject to 윩f, fX = 1, 윩f, fiX = 0 for i = 1,…, k – 1. Corollary 1 of Li and Song (2017) shows that f(X),…, fd(X) generate a subspace of , if it is complete, then generate itself. Note that f-GSIR is an extension of GSIR in that it does not restrict space to L2(PX), or X0 and it applies the RKHS to the construction of ΩX to overcome the non-dense observation problem for functional data.

7. Implementation

7.1. Coordinate representation

Let 꼱i be a finite Hilbert space, Bi={b1i,,bmii} be 꼱i’s basis system, and GBi be a gram matrix with (GBi)jk=bij,bkii for i = 1, 2, 3. We first of the coordinate representation of an element. If f=Σjfjbjii, then the coordinate representation of f, [f]Bi = [f1,…, fmi]. For A : 꼱1 → 꼱2, we define the coordinate representation of A, [A]B2B1=[[Ab11]B2,,[Abm11]B2]. It follows that [Af]B2 = B2 [A]B1 [f]B1. Let C be an operator 꼱2 →꼱3. It can be shown that the coordinate representation AC is B3 [CA]B1 = B3 [C]B2B2 [A]B1.

Furthermore, we can calculate the inner product as f,g1=ijfigjbi1,bj11=[f]B1GB1[g]B1. The tensor product can also be represented by the coordinated representation. If f ∈ 꼱1 and g ∈ 꼱2, then,

[(gf)bi1]B2=f,bi11[g]B2=[g]B2[f]B1GB1ei,

where ei is the ith standard basis vector of 꽍m1. From this, we have

B2[gf]B1=[[(gf)b11]B2[(gf)bm11]B2]=[g]B2[f]B1GB1(e1em1)=[g]B2[f]B1GB1.

7.2. RKHS by coordinate representation

In this subsection, we introduce the implementation of RKHS, using the coordinate representation. Let X1,…, Xn be an i.i.d. sample of X. Let κX be a positive definite kernel and let 꼱X be span{κX(·, Xi) : i = 1,…, n}. We estimate covariance operators at the sample level, by replacing E by the empirical En.

Σ^XX=En[κX(·,X)-Enκ(·,X)][κX(·,X)-Enκ(·,X)].

Hence, the subspace ran¯(Σ^XX) is spanned by

BX={κX(·,Xi)-EnκX(·,X):i=1,,n}={b1(X),,bn(X)}.

First of all, the gram matrix GBX such that (GBX)ij=bi(X),bj(X)X is considered.

bi(X),bj(X)X=κX(·,Xi)-n-1k=1nκ(·,Xk),κX(·,Xj)-n-1=1nκX(·,X)X=κX(Xi,Xj)-n-1=1nκX(Xi,X)-n-1k=1nκX(Xj,Xk)+n-2k=1n=1nκX(Xk,X).

Let KX denote the n×n matrix whose (i, j)th entry is κX(Xi, Xj), and let Q denote the projection matrix In-n-11n1n. Then it is easily shown that GBX = QKXQ. Now it is ready to represent ∑XX by the coordinate representation.

[Σ^XXbi(X)]BX=n-1[(i=1nbk(X)bk(X))bi(X)]=n-1(k=1n[bk(X)]BX[bk(X)]BXGBX)[bi(X)]BX=n-1(k=1nekek)QKXQei=n-1QKXQei.

Hence, it follows that

BX[Σ^XX]BX=[[Σ^XXb1(X)]BX[Σ^XXbn(X)]BX]=n-1QKXQ(e1en)=n-1QKXQ=n-1GBX.

The similar development can be carried out on Y. That is, 꼱Y is the space spanned by

BY={κY(·,Y)-EnκY(·,Y):i=1,n}={bi(Y):i=1,,n}.

The cododinate representation of ∑YY is

BY[Σ^YY]BY=n-1QKYQ=n-1GBY.

In similar way, we derive the coordinate representation of ∑XY and ∑YX as

BX[Σ^XY]BY=n-1GBY, 듼 듼 듼BY[Σ^YX]BX=n-1GBX.

From now on, if the domain and range of an operator are explicit, we abbreviate B1 [A]B2 to just [A].

7.3. GSIR

In the context of GSIR, the inner product in L2(PX) is used, which should also be represented by coordinate form. Since f,gL2(PX)f,ΣXXgX, its coordinate representation is [f]GBX[ΣXX][g]=n-1[f]GBX2[g]. The conditional expectation operator EX|Y is ΣYY-1ΣYX. The problem is that, although ∑YY is invertible analytically, its coordinate representation would be singular. To avoid singular issue, it is often use ([A] + εnIn)−1 rather than [A]−1 directly. Hence, let the coordinate form of EX|Y be

[EXY]=[Σ^YY]-1[Σ^YX]=(GBY+nIn)-1GBX.

Consequently, it follows that

f,E^XY*E^XYfL2(PX)=E^XYf,E^XYfL2(PY)=n-1[f][E^XY]GBY2[E^XY][f]=n-1[f]GBX(GBY+nIn)-1GBY2(GBY+nIn)-1GBX[f].

The eigenvalue problem that GSIR solve is as follows

maxvnn-1vGBX(GBY+nIn)-1GBY2(GBY+nIn)-1GBXv,subject to v,vL2(PX)=n-1vGBX2v=1.

Transforming v to u = n−1/2GBXv, the above problem is changed to

maxunu(GBX+nIn)-1GBX(GBY+nIn)-1GBY2(GBY+nIn)-1GBX(GBX+nIn)-1u,

subject to ||u||n = 1. Let the first d eigenvectors of this problem be {u1,…, ud}, which means that the ith solution of original problem is vi = n1/2(GBX + εnIn)−1ui for i = 1,…, d. From this result, ith sufficient predictor is h(i)(x)=j=1nvijbi(X)(x) for i = 1,…, d.

7.4. KSIR

In the implementation of KSIR, ∑E(κX(·,X) | Y) is used and it has a explicit analytical form.

ΣE(KX(·,X)Y˜)=j=1hP(YJj)ΣE(KX(·,X)YJj)=j=1hP(YJj)[E(KX(·,X)-E(κX(·,X))YJj)][E(KX(·,X)-E(κX(·,X))YJj)].

To implement the KSIR, P(YJj), E(κX(·, X)), and E(KX(·, X) – E(κX(·, X)) | YJj) should be estimated, and their estimates are as follows:

  • P^(YJj)=En(I(YJj))=n-1i=1nI(YiJj)=n-1nj=πj

  • E^(κX(·,X))=En(κX(·,X))=n-1i=1nκX(·,Xi).

  • Ê(KX(·, X) – E(κX(·, X)) | YJj) = (nj)−1i:yiJj (κX(·, Xi) – E(κX(·, X)))

Thus, Σ^E(KX(·,X)YJj)=(nj)-2(k:ykJjbk(X))(:yJjb(X))=(nj)-2k:ykJj:yJjbk(X)b(X). Hence, the coordinate representation of it is

[Σ^E(KX(·,X)YJj)]=(nj)-2k:ykJj:yJjekeGBX=(nj)-2(k:ykJjek)(k:ykJjek)GBX,

from which, it follows that

[Σ^E(KX(·,X)Y˜)]=(j=1hπj(nj)-2(k:ykJjek)(k:ykJjek))GBX.

Let A denote (j=1hπj(nj)-2(k:ykJjek)(k:ykJjek)). Note that A is symmetric.

The eigenvalue problem of the KSIR is maxv∈꼱Xv, ∑E(κX(·,X)|Y)vX subject to 윩v, ∑XXvX = 1. The coordinate form of this problem is

max[v]n[v]GBXAGBX[v],subject to n-1[v]GBX2[v]=1.

Let u = n−1/2GBX [v]. Then, the above problem is transformed to the below problem:

maxunu(GBY+nIn)-1GBXAGBX(GBX+nIn)-1u,subject to un=1.

Let the first d eigenvector of this problem be {u1,…, ud}, from which we can derive that the ith solution of original problem is vi = n1/2(GBX + εnIn)−1ui. Therefore, ith sufficient predictor is hi(x)=j=1nvijbj(X)(x).

7.5. KCCA

(5.3) says that the KCCA solves the generalized singular value problem at the population level.

supfX,gYf,ΣXYgXsubject to 듼 듼 듼f,ΣXXfX=1,g,ΣYYgY=1.

If we represent this problem in coordinate form, we have

supf,gnfGBX[ΣXY]g 듼subject to 듼fGBX[ΣXX]f=gGBY[ΣYY]g=1.

By representing covariance operators in form of gram matrix, we have

supf,gnn-1fGBXGBXg 듼subject to 듼n-1fGBX2f=n-1gGBy2g=1.

Let φ = n−1/2GBXf and ψ = n−1/2GBYg. Then, the above statement is reformulated as

supf,gnφ(GBX+nIn)-1GBXGBY(GBY+nIn)-1ψ 듼subject to 듼φn=ψn=1.

Let φ, ψ ∈ 꽍n be eigenvectors corresponding to the first singular value of (7.2). Then, the solution of (7.1), f, g ∈ 꽍n, are n1/2(GBX + εnIn)−1φ and n1/2(GBY + εnIn)−1ψ, respectively. It means that estimates of KCCA are represented by f^(x)=Σj=1nfjbj(X)(x) and g^(y)=Σj=1ngjbj(Y)(y).

8. Concluding remark

We endeavor to integrate a range of nonlinear SDR methods within the framework established by Lee et al. (2013) and Li and Song (2017). In the practical implementation of nonlinear SDR methods, we recommend using KCCA, GSIR, and GSAVE in general. This is because KSIR requires the response variable to be a scalar value, and other methods are known to have better performance than KSIR. Both KCCA and GSIR demonstrate similar performance, but GSAVE provides slightly different information. More precisely, KCCA and GSIR rely on E[f(X)|Y] while GSAVE extracts information from var(f(X)|Y). Thus the performance depends on the structure of data. See Lee et al. (2013) for a numerical comparison of the methods. However, it is not easy to determine which method is superior based solely on the dataset. Therefore, we suggest trying different methods and choosing the one that works best under an appropriate criterion.

One aspect we have not addressed is the dimensionality of the reduced feature space. The target of nonlinear SDR is the central class . On the other hand, dimensionality is defined for linear subspaces as a geometric structure, leading to confusion in determining an appropriate dimension. For instance, the space spanned by the {f1(x), f2(x)} where f1(x) = x, f2(x) = x3 is a two-dimensional space. But they provide the same sigma-algebra, σ(f1(X)) = σ(f2(X)) = σ(f1(X), f2(X)). This highlights the subtle differences encountered when dealing with nonlinear methods, making the selection of dimensionality a distinctly different problem from that in linear methods. Practically, we assume a finite rank for the target operator defined in the RKHS to extract the dimension. For example, Li and Song (2017) proposed a BIC-based criterion for selecting the dimension d. Nevertheless, more theoretical analysis is necessary to understand the dimensionality of the central class.

References
  1. Aronszajn N (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337-404.
    CrossRef
  2. Baker CR (1973). Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186, 273-289.
    CrossRef
  3. Berlinet A and Thomas-Agnan C (2011). Reproducing Kernel Hilbert Spaces in Probability and Statistics, Boston, MA, Kluwer Academic Publishers.
  4. Cook RD (1998). Regression Graphics, New York, Wiley.
    CrossRef
  5. Cook RD and Weisberg S (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86, 328-332.
    CrossRef
  6. Fukumizu K, Bach FR, and Gretton A (2007). Statistical consistency of Kernel canonical correlation analysis. Journal of Machine Learning Research, 8, 361-383.
  7. Lee K-Y, Li B, and Chiaromonte F (2013). A general theory for nonlinear sufficient dimension reduction: Formulation and estimation. The Annals of Statistics, 41, 221-249.
    CrossRef
  8. Li B (2018). Sufficient Dimension Reduction: Methods and Applications with R, Boca Raton, FL, CRC Press.
    CrossRef
  9. Li B and Song J (2017). Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics, 45, 1059-1095.
    CrossRef
  10. Li K-C (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316-327.
    CrossRef
  11. Sch철lkopf B and Smola AJ (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Cambridge, Mass, MIT press.
  12. Song J, Kim K, and Yoo JK (2023). On a nonlinear extension of the principal fitted component model. Computational Statistics and Data Analysis, 182, 107707.
    CrossRef
  13. Wu H-M (2008). Kernel sliced inverse regression with applications to classification. Journal of Computational and Graphical Statistics, 17, 590-610.
    CrossRef
  14. Yeh Y-R, Huang S-Y, and Lee Y-J (2008). Nonlinear dimension reduction with kernel sliced inverse regression. IEEE Transactions on Knowledge and Data Engineering, 21, 1590-1603.
    CrossRef