TEXT SIZE

CrossRef (0)
Concave penalized linear discriminant analysis on high dimensions

Sunghoon Kwona, Hyebin Kima, Dongha Kimb, Sangin Lee1,c

aDepartment of Applied Statistics, Konkuk University, Korea;
bDepartment of Statistics, Sungshin Women’s University, Korea;
cDepartment of Information and Statistics, Chungnam National University, Korea
Correspondence to: 1 Department of Information and Statistics, Chungnam National University, 99 Daehak-ro, Yuseonggu, Daejeon 34134, Korea. E-mail: sanginlee44@gmail.com
This paper was supported by Konkuk University in 2021.
Received December 10, 2023; Revised March 20, 2024; Accepted April 9, 2024.
Abstract
The sparse linear discriminant analysis can be incorporated into the penalized linear regression framework, but most studies have been limited to specific convex penalties, including the least absolute selection and shrinkage operator and its variants. Within this framework, concave penalties can serve as natural counterparts of the convex penalties. Implementing the concave penalized direction vector of discrimination appears to be straightforward, but developing its theoretical properties remains challenging. In this paper, we explore a class of concave penalties that covers the smoothly clipped absolute deviation and minimax concave penalties as examples. We prove that employing concave penalties guarantees an oracle property uniformly within this penalty class, even for high-dimensional samples. Here, the oracle property implies that an ideal direction vector of discrimination can be exactly recovered through concave penalized least squares estimation. Numerical studies confirm that the theoretical results hold with finite samples.
Keywords : concave penalties, linear discriminant analysis, direction vector, oracle property, high dimension
1. Introduction

The Linear Discriminant Analysis (LDA) stands out as a popular technique for classification tasks. However, its applicability falters when the sample is high-dimensional; that is, the number of input variables p is larger than the sample size n, which is a situation that has become increasingly prevalent in practical scenarios. In these cases, the classical LDA faces a critical limitation in estimating the population covariance matrix due to the singularity of the pooled sample covariance matrix. For example, Bickel and Levina (2004) pointed out that the performance of the classical LDA can be as poor as a random guess for a high-dimensional sample.

Many authors have studied modifications of the classical LDA concerning high-dimensional sample, with early proposals based on the independence rule and variable selection. Bickel and Levina (2004) proposed to use the independence rule or the diagonal LDA paradigm for estimating the population covariance matrix by ignoring the correlation structure of the input variables. Tibshirani et al. (2002) and Fan and Fan (2008) introduced the nearest shrunken centroid estimation and the features annealed independent rule, in which the variable selection is done by soft and hard thresholding rules, respectively. However, these methods do not achieve Bayes classification error for the classification task unless the true common covariance matrix is diagonal.

Recently, the sparse LDA approaches have been proposed as an alternative to the independence rule. Cai and Liu (2011) proposed the linear programming discriminant rule that finds a sparse estimate of the population covariance matrix by using the Dantzig selector (James et al., 2009), and Fan et al. (2012) proposed the regularized optimal affine discriminant method, which is based on the β1-Fisher’s discriminant analysis. Similar β1-penalized linear discriminant analysis was studied in Trendafilov and Jolliffe (2007) and Witten and Tibshirani (2011). Clemmensen et al. (2011) proposed the sparse optimal scoring method which solves the optimal scoring formulation (Hastie et al., 1994) with the β1-penalty. The Direct Sparse Discriminant Analysis (DSDA) studied in Mai et al. (2012) is another popular sparse LDA, which is computationally efficient and easier to understand, because it reformulates the high-dimensional LDA into a penalized linear regression framework. In general, the sparse LDA approaches have theoretical advantages over other methods of independence rules since they allow non-diagonal population covariance matrices, enabling the resulting classifier to achieve Bayes classification error. However, most previous works have been limited to specific convex penalties such as the Least Absolute Selection and Shrinkage Operator (LASSO) and its variants.

In this study, we focus on the DSDA applied with a class of concave penalties, including the smoothly clipped absolute deviation (Fan and Li, 2001), minimax concave (Zhang, 2010), and truncate d-β1 (Shen et al., 2013) penalties as examples. Within this framework, implementing the concave penalized direction vector of discrimination appears to be straightforward, but developing the theoretical properties of the estimated direction vector still remains challenging. For this purpose, we prove that the concave penalized direction vector satisfies an oracle property uniformly in the class, which is the main contribution of the paper. Here, the oracle property implies that an ideal LDA direction vector of discrimination can be exactly recovered through concave penalized least square estimation. In addition, we provide various numerical studies to confirm whether the theoretical results hold with finite samples.

The rest of the paper is organized as follows. Section 2 introduces the sparse LDA. Section 3 introduces the concave penalized LDA and presents the related theoretical results. Section 4 provides numerical studies to confirm the theoretical results, and concluding remarks are given in Section 5. Technical details and proofs are provided in Appendix.

2. Sparse linear discriminant analysis

2.1. Fisher’s linear discriminant analysis

Fisher’s Linear Discriminant Analysis (LDA) (Fisher, 1936) is an efficient technique for discriminating a binary class label C ∈ {0, 1} given an input vector X ∈ βp. The LDA assumes that X is a random vector with multivariate normal distribution given a binary random variable C such that

Xβ£C=c~N(μc,Σc), ββ ββ ββc{0,1},

independently, where μc and ∑c are the mean vector and covariance matrix of the normal distribution, respectively. Let φ(·; μc, ∑c), c ∈ {0, 1} be the density function of N(μc, ∑c) and πc = P(C = c). The ratio between two conditional density values of C|X = x provides a natural rule for discriminating the class label as C = 1 given X = x as follows.

π1φ(x;μ1,Σ1)π0φ(x;μ0,Σ0)1, ββ ββ ββxβp.

In addition, when the covariance matrices in (2.1) are assumed to be common, ∑0 = ∑1 = ∑, the discrimination rule in (2.2) can be simplified into a linear form,

ψ(x)={x-(μ0+μ1)/2}TβBS+log (π1/π0)0, ββ ββ ββxβp,

where βBS = ∑−1(μ1μ0) is Bayes direction vector.

Let (xi, ci), in be n random samples of (X, C) then the LDA estimate ψΜ of ψ in (2.3) can be obtained by using simple moment type estimates:

ψ^LDA(x)={x-(μ^0+μ^1)/2}Tβ^LDA+log (π^1/π^0)0, ββ ββ ββxβp,

where βΜLDA = ∑Μ−1(μΜ1μΜ0) is the LDA direction vector,

μ^c=ci=cxi/nc,Σ^c=ci=c(xi-μ^c)(xi-μ^c)T/(nc-1),Σ^=c{0,1}(nc-1)Σ^c/(n-2),π^c=nc/n,

and nc is the number of samples with ci = c.

2.2. Sparse linear discriminant analysis

In many real fields, application of the LDA raises two challenging problems because of sparsity of the model and high-dimensionality of the samples. The sparsity of the model implies that Bayes direction vector satisfies βjBS=0 for some jp but, in general, the LDA estimate in (2.4) produces β^jLDA0 for all jp. The high-dimensionality of the samples implies p > n, where the pooled sample covariance matrix ∑Μ in (2.5) becomes singular. In both cases, the discriminant analysis may produce unexpected bad results in prediction as well as interpretation. There are many literatures that figure out these problems and the penalized approach via the least squares estimation can be a nice solution (Mai et al., 2012).

Let yi = (−1)1+cin/nci, in, ci ∈ {0, 1} then the LDA direction vector in (2.4) can be interpreted as a Least Squares Estimator (LSE) as follows:

β^LSE=kβ^LDA

for some positive constant k unless μΜ0μΜ1 (Hastie et al., 2009), where

(α^LSE,β^LSE)=arg minα,βi=1n(yi-α-xiTβ)2/2n.

Hence the LDA rule in (2.4) can be cast into the LSE rule:

ψ^LSE(x)={x-(μ^0+μ^1)/2}Tβ^LSE+klog (n1/n0)0, ββ ββ ββxβp.

Given the sparsity and high-dimensionality, it is natural to employ the penalized LSE:

(α^λ,β^λ)=arg minα,β{i=1n(yi-α-xiTβ)2/2n+j=1pJλ(β£βjβ£)},

where Jλ is a penalty equipped with a tuning parameter λ. In this case, the LSE rule in (2.8) can be replaced with the penalized LSE rule again:

ψ^λ(x)={x-(μ^1+μ^0)/2}Tβ^λ+kλlog (n1/n0)0,

where kλ = βΜλT∑Μ βΜλ/(μΜ1μΜ0)TβΜλ is given by Mai et al. (2012) that is optimal to the discrimination whenever ( μΜ1μΜ0)TβΜλ > 0.

3. Concave penalized linear discriminant analysis

3.1. Concave penalized estimation

Let Jλ be a penalty function with tuning parameter λ > 0 and Jλ be the first derivative of Jλ. We assume that following conditions hold.

• (J1) Jλ(t) is concave and non-decreasing over t ∈ [0, ∞) and Jλ(0) = 0.

• (J2) Jλ(t) is non-increasing and continuous over t ∈ (0, ∞) and limt0+Jλ(t)=λ.

• (J3) Jλ(t)λ-t/a over t ∈ (0, ) and Jλ(t)=0 over t ∈ (, ∞) for some a > 0.

The class of penalties that satisfy (J1), (J2), and (J3) has been studied as a representative concave penalty class (Kim and Kwon, 2012; Zhang and Zhang, 2012), including the Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li, 2001),

Jλ(t)=min {λ,(aλ-t)+/(a-1)}, ββ ββ ββa>2,

minimax concave (MCP) (Zhang, 2010),

Jλ(t)=(λ-t/a)+, ββ ββ ββa>1,

and truncated-β1 penalties (TLP) (Shen et al., 2013)

Jλ(t)=λmin {t,a}, ββ ββ ββa>1,

as examples, where x+ = max{0, x}.

Let Z = (Z1, . . . , Zp) be the centered design matrix in (2.7) then the LSE direction vector in (2.7) can be simplified into

β^LSE=arg minβ{βZββ22/2n-(μ^1-μ^0)Tβ}.

Now, considering the true signal set S={jβ£βjBS0}, one of the best estimators for the sparse Bayes direction vector is the oracle LSE,

β^OR=arg minβj=0,jSc{βZββ22/2n-(μ^1-μ^0)Tβ},

which is unavailable in practice without the knowledge of . Hence, the main objective of the concave penalized LDA is to recover βΜOR through the penalized LSE:

β^λ=arg minβLλ(β),

where

Lλ(β)=βZββ22/2n-(μ^1-μ^0)Tβ+j=1pJλ(|βj|).

3.2. Optimality conditions

In this subsection, we introduce three lemmas that provide non-asymptotic sufficient conditions for a given estimator to be a local or unique local minimizer of Lλ. Let Ξλ be the set of all local minimizers of Lλ and θΜ = μΜ1μΜ0.

Lemma 1

If βΜ ∈ βpsatisfies minβΜj≠0 |βΜj| > aλ and maxβ^j=0β£ZjTZβ^/n-θ^jβ£λ then βΜ ∈ Ξ.

The conditions in Lemma 1 are simply the sub-gradient optimality conditions for the penalty class that satisfy penalty conditions (J1), (J2), and (J3), under which a given estimator becomes a local minimizer (Kim et al., 2008; Kwon et al., 2021) of Lλ. We say that βΜ ∈ Ξλ satisfies the uniqueness condition (Kim and Kwon, 2012) with ρ > 0 if

maxβ^j=0|ZjTZβ^/n-θ^j|<λmin{1,aρ}λ(1+aρ)<ρminβ^j0|β^j|.

The following lemma states that the uniqueness condition forms a sufficient condition for a local minimizer to be unique. Let ρmin(A) be the minimum eigenvalue of a square matrix A.

Lemma 2

If βΜ ∈ Ξλ satisfies the uniqueness condition with ρ = ρmin(ZTZ/n) > 0 then βΜ is a unique local minimizer of Lλ, that is, Ξλ = {βΜ }.

Lemma 2 requires ZTZ/n is non-singular which is impossible to hold for high-dimensional samples with p > n. However, we can extend the result to the cases by assuming the true model is sparse enough to estimate. For any subset , let be a sub-design matrix of Z constructed by the columns Zj, . We say that Z satisfies the Sparse Riesz condition (Zhang, 2010) with ρ > 0 and rank r > 0 if

minβ£Dβ£rρmin(ZDTZD/n)>ρ.

Let Ξλκ={β^Ξλ:ββ^β0κ}Ξλ be a restricted set of local minimizers of Lλ for some κp, where ||a||0 denotes the number of nonzero elements of a vector a. The following lemma extends the result of Lemma 2 to high-dimensional cases.

Lemma 3

If Z satisfies the Sparse Riesz condition with ρ > 0 and rank r ≥ 2κ and βΜ ∈ Ξλ satisfies the uniqueness condition with ρ then Ξλκ={β^}.

Remark 1

(a) The uniqueness and Sparse Riesz conditions in (3.3) and (3.4) are modified versions for the penalized LDA from those for the penalized linear regression (Kim and Kwon, 2012; Zhang, 2010), respectively. Lemma 1,2, and 3 are direct applications of Theorem 2 and 3 in Kim and Kwon (2012). (b) In Lemma 3, κp represents the maximum number of candidate input variables that can be included in the final model, expecting that the corresponding local minimizer is unique in Ξλκ, which is impossible for Ξλ in general. Hence, it is reasonable to assume that 2κrn since κ should be assumed or expected to be significantly less than the sample size.

3.3. Oracle property

In this subsection, we present the main results of the paper: The concave penalized LSE in (3.2) is unique, and hence, the global minimizer of Lλ, and the same as the oracle LSE in (3.1) asymptotically, in the sense that Ξλκ={β^λ} and βΜλ = βΜOR with probability tending to 1.

3.3.1. Notations

We introduce some notations. For any matrix A, let ||A|| = maxij |Aij|, where Aij is the (i, j) entry of A, and let be a sub-matrix of A constructed by the entry Aij, . For any vector a, let ||a|| = maxi |ai| and ||a||1 = ∑i |ai|, where ai is the ith element of a, and let be a sub-vector of a constructed by the elements ai, . Let be the cardinality of a set . For any positive sequences xn and yn, we write xn β« yn if xn/yn → ∞ and xn β yn if xn/yna for some constant a as n→ ∞.

Recall the definition of the oracle LSE βΜOR obtained under the knowledge of the true signal set S={jβ£βjBS0}. It is easy to see that

β^SOR=Ω^SS-1θ^S ββ ββ ββand ββ ββ βββ^NOR=0,

where ΩΜ = ZTZ/n, θΜ = μΜ1μΜ0, and N={jβ£βjBS=0} is the true noisy set. Hence, we can construct a population counterpart βOR of βΜOR by defining

βSOR=ΩSS-1θS ββ ββ ββand ββ ββ βββNOR=0,

where Ω = Cov(X) and θ = μ1μ0. Note that βOR is exclusively considered for the development of theoretical properties since βOR = BS for some constant k ≠ 0. For further details, refer to Proposition 3 in Mai et al. (2012).

3.3.2. Oracle property

The main objective of the concave penalized LSE is to recover βΜOR by using βΜλ in (3.2), which is often referred to the strong oracle property (Fan and Li, 2001; Kim et al., 2016). From the lemmas in previous subsection, it is sufficient to prove that βΜOR satisfies the outlined uniqueness and Sparse Riesz conditions with probability tending to 1 of which the proofs are provided in Appendix. We assume the following regularity conditions:

(C1) There exist positive constants, bi, i ≤ 4, such that

ρmin(Ω)>b1, ββ ββ βββΩNSΩSS-1β<b2, ββ ββ βββΩSS-1β<b3, ββ ββ ββand ββ ββ βββθSβ<b4,

for any n.

Remark 2

The conditions in (C1) specify technical requirements for the oracle property that are slightly weaker than those applied with the Least Absolute Selection and Shrinkage Operator (LASSO) in Mai et al. (2012). For the LASSO, we need b1 = 1 that plays as the strong irrepresentable condition in Zhao and Yu (2006) for linear regression.

Theorem 1

Assume that (C1) holds. The oracle LSE is unique local, and hence, the global minimizer of Lλwith probability tending to one, in the sense that

limnP(Ξλκ={β^OR})=1,

for any sequence κ = κ(n) that satisfies 2qκn, provided that

λ=o(mS), ββ ββ ββlog p=o(min {n/κ3,nλ2/q2}) ββ ββ ββand ββ ββ ββmin {n/κ2,nλ2/q2},

as n→ ∞, whereandmS=minjSβ£βjORβ£.

Remark 3

If n/κ3 β 2/q2 → ∞then conditions in Theorem 1 can be simplified as

mSβ«λβ«qlog p/n ββ ββ ββand ββ ββ ββlog p=o(nλ2/q2)

as n→ ∞. Therefore, we can observe the following theoretical properties:

• The penalized LDA allows for exponentially many input variables, polynomially many signal variables, and diminishing regression coefficients, satisfying that p = o(exp(2/q2)), q = o(n1/3), and at a slower rate than qlog p/n.

• Compared with the ordinary penalized linear regression (Zhang, 2010), the requirements are stronger since p = o(exp(2)), q = o(n), and 0 at a slower rate than log p/n.

• The stronger requirements mainly comes from the random design matrix Z in (2.7) and the random Sparse Riesz condition in (3.4). We can check similar results in many literatures, for example, p = o(n1/2) for the autoregressive model (Na and Kwon, 2018) and q = o(n1/5) for general maximum likelihood estimation (Fan and Peng, 2004; Kwon and Kim, 2012).

4. Numerical studies

In this section, we report some results of simulation studies and real data analysis.

4.1. Simulation studies

We generated n random samples of C from Ber(π) with π = 1/2, and then generated n random samples of X|C = c, c ∈ {0, 1} from N(μc, ∑). We set μ0 = 0 and μ1 = ∑βBS, where ∑jk = r|jk| and βjBS=(2/j)(-1)jI(jq), j, kp. We consider three concave penalties, TLP with a = 0.001, MCP with a = 1.001, and SCAD with a = 2.001, as examples in the penalty class, and set n ∈ {500, 1000, 2000}, p ∈ {300, 3000}, and q ∈ {10, 20} with r ∈ {0.25, 0.5} for each simulation. We repeated the simulation 200 times by using R package ncpen (Kim et al., 2020).

We first investigated whether the oracle property can hold with finite samples and Table 1 shows the estimated probabilities of achieving the oracle property. For each simulation, we first found the interval [λmax, λmin] that satisfies ||βΜλmax||0 = 0 and ||βΜλmin||0 = 50. Then we checked whether there exists a λ that satisfies βΜλ = βΜOR by investigating 200 values of λ decreasing with log-scale in the interval. From the table, we observed the followings:

• The ratios of the TLP and MCP approaches nearly 1 for some cases, while the SCAD has lower ratios, with the largest ratio being 0.79. In cases where both p and q are large, it is rare for the oracle property to hold, but we believe that this occurs due to limitations in the simulation settings. In general, we can conclude that the ratio increases as n increases and p, q, and r decreases, regardless of the simulation settings and penalties, which confirms the result of Theorem 1.

• In addition, we checked the sign consistency, sign(βΜλ) = sign(βΜOR) for some λ, which is a slightly less stringent condition than the oracle property. The results exhibit similar pattern to that of the oracle property but the ratios are slightly larger. We found that the LASSO also achieves the sign consistency as Mai et al. (2012) proved. In general, the sign consistency of the LASSO does not hold even for the ordinary linear regression since it requires the strong irrepresentale condition on the design matrix. See Section 3 and example 3 in Zhao and Yu (2006) for some details.

Second, we compared the finite sample performance of the concave penalized estimators, using the LASSO as a benchmark method. The primary objective of the simulation is to check whether the oracle property can be realized through typical tuning parameter selection criteria. Note that there are two natural tuning parameter selection criteria from the characteristic of the framework: Minimizing the regression error in (2.7) and the classification error in (2.9). We used n training samples for estimating the penalized direction vector and n/2 independent validation samples for selecting an optimal tuning parameter λopt by minimizing the two criteria. We calculated three measures of selection accuracy: The numbers of true and false positive selection (TPS and FPS) and the ratio of the correct model selection (CMS):TPS=ΣjI(β^jλopt0,βjBS0),FPS=ΣjI(β^jλopt0,βjBS=0), and CMS is the ratio of the cases when TPS = q and FPS = pq. In addition, we calculated the miss-classification error rate (ERR) by using 2n independent test samples.

Tables 2 and 3 summarizes the results. We first discuss the cases when λopt was chosen based on the regression errors of the validation samples:

• TPS increases to q as n increases, regardless of the penalties and simulation settings. This suggests that the signal variables tend to be selected as the sample size increases. As p or r increases while keeping n fixed, TPS decreases. This implies that higher model complexity and stronger correlation among the input variables have a negative effect on selection of signal variables. Among the concave penalties, the SCAD produces the largest TPS but the difference does not seem significantly large.

• For the concave penalties, FPS decreases as n increases for almost all cases. This implies that the noisy variables tend to be excluded as the sample size increases, which is not true for the LASSO. FPS of the SCAD is smaller than that of the LASSO but it is not at a negligible level even when n = 2000. However, FPS for the TLP and MCP are near 0 when n = 2000, implying that they excluded nearly all noisy variables.

• For the concave penalties, CMS increases as n increases regardless of the penalties and simulation settings, achieving the largest score 0.80, 0.77, and 0.34 for the TLP, MCP, and SCAD, respectively. For the LASSO, CMS is consistently 0, indicating that the regression errors does not work for the LASSO to select correct models.

• One of the reasons for the poor FPS and CMS of the SCAD seems to be because its shape or concavity on the interval [0, ]. Since the SCAD penalty is the same as the LASSO on [0, λ] and a > 2 is limited below, the maximum concavity on [0, ] is smaller than that of the MCP with a > 1, which results in selecting more variables with higher FPS and smaller CMS than those of the MCP. We refer to Zhang (2010) for some detailed discussion on the maximum concavity of the concave penalties.

• ERR decreases to that of the Bayes, the best performance, as n increases but increases as p or r increases with fixed n, regardless of the penalties and simulation settings.

For the cases of classification errors, we observed the followings:

• TPS shows similar patterns as n increases, resembling the case of regression errors. This occurs regardless of the penalties and simulation settings.

• The LASSO and SCAD select significantly less noisy variables, producing lower FPS and slightly higher CMS. However, the TLP and MCP show opposite results.

To sum up, (a) we can conclude that tuning parameter selection based on the regression errors exhibits better selection performance for the TLP and MCP. However, for the LASSO and SCAD, the classification errors seems to be more informative, as indicated by smaller FPS and larger CMS. (b) The two tuning parameter selection criteria can function to some extent in the simulation settings designed in this paper. (c) However, the probabilities of correctly identifying the true model in Table 1 consistently reach nearly 1, regardless of the penalties and simulation settings. This suggests that we need to develop other alternatives such as the information criterion based on the underling probability structure.

4.2. Real data analysis

We apply the penalized LDA methods to two benchmark datasets: The prostate and lung cancer datasets, which is available from the R packages datamicroarray. The prostate cancer dataset consists of the expression levels of approximately 12,600 genes obtained from 52 prostate tumour samples and 50 non-tumour prostate samples (Singh et al., 2002). Likewise, the lung cancer dataset comprises 12,533 genes from 150 patients with adenocarcinoma and 31 patients with malignant pleural mesothelioma (Gordon et al., 2002). The main task is to predict whether an observation is the specific tumor tissue or not, and to identify the genes associated with each type of cancer. We first choose top p ∈ {500, 1000, 2000} genes with the largest t-statistics across two classes. We then applied the penalized LDA methods with two classes as a response variable and the chosen top p genes as predictive variables. For comparison, we split the data into the training and test sets by randomly choosing 3/4 samples and 1/4 samples, respectively. For each training set, the tuning parameter λ is selected by the 10-fold cross-validation methods based on regression error and classification error as in the simulations. We compute the mis-classification error rate (ERR), the number of mis-classified samples (#MIS) on the test set, and the model size (SIZE) representing the number of selected variables on the training set.

Table 4 presents the average values of the three measures obtained from 200 random partitions of data. In most cases, the LASSO shows the best prediction performance but selects the most variables. The TLP, MCP, and SCAD show similar prediction performances and they substantially selects fewer variables than the LASSO. Among the concave penalized methods, the SCAD selects more variables than other methods. Similar to the simulation results, the methods based on regression error exhibit better prediction performances. For model size, the LASSO and SCAD based on the classification error produce more sparse models while the methods based on regression error produce more sparse model for TLP and MCP. These results suggest that the concave penalized LDA method could be a good alternative when we wish to construct the sparse model without losing the prediction accuracy much.

5. Concluding remarks

In this paper, we studied the high-dimensional LDA based on the concave penalized linear regression. We proved that an oracle property holds uniformly on a class of concave penalties, including the TLP, SCAD, and MCP as examples. The primary advantage of the concave penalized approach lies in its superior selection performance compared to the convex penalized approach, as supported by the simulation studies. In addition, we found that the tuning parameter selection criteria based on the prediction errors work to some extent, and hence, we may use the criteria in practice. However, we believe that there are better alternatives, such as the Bayesian information criterion, which has been proven to be useful for other penalized approaches, including the penalized linear regression (Fan and Tang, 2012).

Appendix

Let ||A2|| = sup||u||2=1 ||Au||2 denote the spectral norm of a matrix A. Let δΜ = θΜθ, ΔΜ = ΩΜ − Ω, and ΓΜ = ΩΜ−1 − Ω−1.

Lemma 4.

(Mai et al., 2012) There exist positive constants ε0 and ci, i ≤ 4 such that

P ( β£ δ ^ D β£ Ι ) 2 r exp  ( - c 1 n Ι 2 ) , P ( β Δ ^ D D β Ι ) 2 r 2 exp  ( - c 2 n r - 2 Ι 2 ) , P ( β Γ ^ D D β Ι ) 2 r 2 exp  ( - c 4 n r - 2 Ι 2 ) , P ( β Λ ^ D c D β Ι ) 2 p r exp  ( - c 3 n r - 2 Ι 2 ) ,

for any εε0 and subset , where and Λ ^ D c D = Ω ^ D c D Ω ^ D D - 1 - Ω D c D Ω D D - 1.

Proof of Theorem 1

Let ρ = ρmin(Ω) and κ be a sequence with 2qκn. First, we will show that Z satisfies the Sparse Riesz condition with ρ/2 and rank κ with probability tending to 1. For any subset ,

ρ min ( Ω ^ D D ) = inf β u β 2 = 1 { u T Ω D D u - u T ( Ω D D - Ω ^ D D ) u } ρ min ( Ω D D ) - β Ω D D - Ω ^ D D β 2 .

From Cauchy’s interlacing theorem, . By Lemma 4, it follows that

P ( min β£ D β£ κ ρ min ( Ω ^ D D ) ρ / 2 ) P ( max β£ D β£ κ β Ω D D - Ω ^ D D β 2 min β£ D β£ κ ρ min ( Ω D D ) - ρ / 2 ) β£ D β£ κ P ( β Δ ^ D D β ρ / 2 ) p κ 2 κ 2 exp  ( - c 2 n κ - 2 ( ρ / 2 ) 2 ) 0 ,

provided that n/κ2 → ∞ and n/κ2 β« κ log p as n → ∞.

Second, we will prove that βΜOR satisfies the first inequality in the uniqueness condition with ρ/2. On the event is invertible since . Hence, (5.1) implies that β ^ S OR satisfies

β ^ OR = Ω ^ S S - 1 θ ^ S

with probability tending to 1. By the triangular inequality,

min j S | β ^ j OR | min j S | β j OR | - max j S | β ^ j OR - β j OR | m S - β β ^ S OR - β S OR β .

Since β S OR = Ω S S - 1 θ S,

β ^ S OR - β S OR = Ω ^ S S - 1 θ ^ S - Ω S S - 1 θ S = Γ ^ S S δ ^ S + Ω S S - 1 δ ^ S + Γ ^ S S θ S .

From Lemma 4, there exist positive constants d1 and d2 such that

β β ^ S OR - β S OR β β Γ ^ S S β β δ ^ S β + β Ω S S - 1 β β δ ^ S β + β Γ ^ S S β β θ S β d 1 β Γ ^ S S β + d 2 β δ ^ S β ,

for all sufficiently large n. From Lemma 4 again, there exist positive constants d3 and d4 such that

P ( min j S β£ β ^ j OR β£ λ ( a + 2 / ρ ) ) P ( m S - β β ^ S OR - β S OR β λ ( a + 2 / ρ ) ) P ( d 1 β Γ ^ S S β + d 2 β δ ^ S β m S - λ ( a + 2 / ρ ) ) P ( β Γ ^ S S β d 3 m S ) + P ( β δ ^ S β d 4 m S ) 2 q 2 exp  ( - c 4 n q - 2 ( d 3 m S ) 2 ) + 2 q exp  ( - c 1 n ( d 4 m S ) 2 ) ,

for all sufficiently large n since . Hence the first inequality in (3.3) holds with probability tending to 1, provided that n m S 2 / q 2 β« n λ 2 / q 2 and 2/q2 β« log p as n→ ∞.

Third, we will show that the third inequality in the uniqueness condition holds with probability tending to 1. By using θ N = Ω N S Ω S S - 1 θ S,

θ ^ N - Ω ^ N S β ^ S OR = δ ^ N + θ N - Ω ^ N S Ω ^ S S - 1 θ ^ S = δ ^ N + Ω N S Ω S S - 1 θ S - Ω ^ N S Ω ^ S S - 1 θ ^ S .

From Lemma 4, there exist positive constants e1 and e2 such that

β θ ^ N - Ω ^ N S β ^ S OR β β δ ^ N β + β Ω N S Ω S S - 1 θ S - Ω ^ N S Ω ^ S S - 1 θ ^ S β β δ ^ N β + β Λ ^ N S β β δ ^ S β + β Ω N S Ω S S - 1 β β δ ^ S β + β θ S β β Λ ^ N S β e 1 β δ ^ β + e 2 β Λ ^ N S β ,

for all sufficiently large n. From Lemma 4 again, there exist positive constants e3 and e4 such that

P ( β θ ^ N - Ω ^ N S β ^ S OR β > λ min  { 1 , a ρ / 2 } ) P ( e 1 β δ ^ β + e 2 β Λ ^ N S β > λ min { 1 , a ρ / 2 } ) P ( β δ ^ β > e 3 λ ) + P ( β Λ ^ N S β > e 4 λ ) 2 p exp  ( - c 1 n ( e 3 λ ) 2 ) + 2 ( p - q ) q exp  ( - c 3 n q - 2 ( e 4 λ ) 2 ) ,

for all sufficiently large n. Hence the third inequality in (3.3) holds with probability tending to 1, provided that 2/q2 → ∞ and 2/q2 β« log p as n → ∞.

Acknowledgement
This paper was supported by Konkuk University in 2021.
TABLES

Table 1

Estimated probabilities of including correct model

rpqnOracle propertySign consistency

0.25300105000.380.3700.530.380.380.14
10000.950.970.130.990.960.970.78
200010.980.791110.99

205000000000
10000.110.1100.130.110.110.03
20000.870.870.010.830.870.870.43

3000105000.050.0700.100.050.080.02
10000.890.8900.900.890.900.48
200010.980.331110.99

205000000000
10000000.01000
20000.500.5000.500.500.510.09

0.5300105000.050.11000.060.110
10000.720.760.010.300.720.760.18
2000110.530.88110.86

205000000000
100000.010000.010
20000.380.3800.040.380.380.01

3000105000000000
10000.320.3700.030.320.370
20000.960.970.140.710.960.970.36

205000000000
10000000000
20000.080.08000.080.080

Table 2

Averages of the four measures: Validation by regression errors

rpqnTPSFPS

0.25300105009.979.319.339.6529.291.151.0414.01
100010.009.999.9910.0030.480.660.518.08
200010.0010.0010.0010.0029.400.480.374.92

2050018.6613.4413.4215.6229.742.362.1812.64
100019.9418.9418.9819.0530.022.502.3410.43
200019.9919.9419.9419.9830.260.720.518.46

3000105009.747.697.808.8136.881.120.9618.79
100010.009.919.919.9638.560.570.4417.28
200010.0010.0010.0010.0037.880.470.306.08

2050015.648.558.6311.6734.841.311.1517.91
100019.0215.3715.3216.5532.122.211.7513.99
200019.9719.6419.6419.5231.381.210.9810.87

0.5300105009.748.208.318.8637.052.061.8013.90
10009.999.999.9710.0038.491.020.9010.03
200010.0010.0010.0010.0037.440.420.354.27

2050016.1210.2110.3011.8333.762.622.6213.12
100019.4217.3117.2516.7131.204.724.0512.86
200019.9719.7119.7419.5130.681.741.609.65

3000105007.945.875.816.4741.961.150.9419.51
10009.789.179.169.1141.511.010.8820.05
200010.009.999.9910.0041.880.390.318.20

205009.885.935.777.1939.891.321.1018.78
100014.7210.7610.7811.1536.702.001.9418.46
200018.9817.8217.8215.8832.212.562.6014.61

rpqnCMSERR

0.253001050000.280.2900.0930.0920.0920.093
100000.570.660.040.0880.0850.0850.085
200000.710.790.340.0860.0840.0840.084

2050000000.0820.0880.0880.089
100000.060.0700.0720.0710.0710.075
200000.530.6700.0680.0660.0660.067

30001050000.030.0500.1030.1050.1030.109
100000.650.6900.0910.0860.0860.090
200000.660.800.140.0870.0840.0840.084

2050000000.0970.1040.1030.107
100000000.0800.0790.0780.087
200000.350.4200.0720.0670.0670.073

0.53001050000.040.0800.1730.1680.1670.166
100000.490.5300.1590.1500.1500.149
200000.720.770.280.1520.1480.1480.148

2050000000.1670.1680.1670.165
1000000.0100.1470.1410.1410.148
200000.280.3200.1350.1300.1300.132

30001050000000.2020.1870.1850.196
100000.250.3200.1710.1560.1550.164
200000.670.760.050.1570.1480.1480.148

2050000000.2030.1880.1860.196
100000000.1710.1590.1580.166
200000.070.0700.1490.1360.1360.147

Table 3

Averages of the four measures: Validation by classification errors

rpqnTPSFPS

0.25300105009.808.798.799.3014.671.871.848.96
100010.009.829.799.9514.082.442.396.28
200010.009.959.939.9914.543.312.915.10

2050017.4112.5712.6213.7214.542.652.866.95
100019.5417.3717.4317.9015.522.312.285.80
200019.9819.5019.5319.8614.331.982.015.63

3000105009.367.367.518.0815.341.371.2310.29
100010.009.749.729.8018.611.131.3110.81
200010.009.989.969.9915.931.941.965.40

2050013.988.298.4410.2316.911.501.809.29
100018.4714.8414.9315.3617.622.472.548.24
200019.9318.9519.0819.1418.731.501.656.87

0.5300105009.207.988.068.4521.942.743.1310.80
10009.959.779.719.8423.102.822.348.84
200010.009.949.969.9721.132.832.766.34

2050014.6910.1410.6311.0022.633.324.2310.12
100018.9915.9615.9815.1923.224.244.158.63
200019.8719.2719.1919.0521.892.972.937.27

3000105007.125.605.665.9123.501.621.5412.70
10009.518.978.978.4324.921.972.1212.08
20009.999.869.859.9325.881.201.758.47

205008.295.775.696.1021.491.691.5611.51
100013.7210.4810.329.8924.713.292.5910.57
200018.7617.5117.4114.8826.593.523.429.78

rpqnCMSERR

0.25300105000.030.150.140.020.0970.0970.0970.098
10000.090.330.290.150.0890.0880.0880.088
20000.200.430.440.420.0860.0860.0860.085

2050000000.0860.0920.0910.095
100000.040.0400.0750.0740.0740.078
20000.030.290.230.070.0690.0670.0670.068

30001050000.050.0500.1070.1080.1060.113
10000.110.520.410.020.0920.0880.0880.092
20000.070.430.430.240.0870.0860.0860.085

2050000000.1020.1060.1050.112
100000000.0820.0810.0810.090
20000.040.260.2500.0730.0690.0690.075

0.53001050000.040.0400.1780.1710.1710.170
100000.250.230.030.1600.1540.1550.153
200000.500.410.170.1530.1500.1500.150

2050000000.1710.1700.1700.169
100000000.1490.1450.1450.150
200000.140.1100.1370.1320.1320.134

30001050000000.2070.1900.1890.200
100000.190.2100.1740.1600.1600.168
200000.540.470.080.1580.1500.1500.150

2050000000.2090.1910.1890.201
100000000.1740.1620.1620.171
200000.050.0400.1500.1380.1380.149

Table 4

Averages of the measures in prostate and lung cancer data analysis

DataMeasurepCross-validation by regression errorCross-validation by classification error

ProstateERR(%)5006.349.329.449.047.289.649.489.06
10007.069.409.249.127.8410.289.809.16
20007.769.409.329.308.0010.3810.569.14

#MIS5001.582.332.362.261.822.412.372.27
10001.762.352.312.281.962.572.452.29
20001.942.352.332.332.002.602.642.28

SIZE50025.314.604.0011.2412.255.985.178.45
100025.653.083.1013.409.126.436.689.03
200025.503.012.2112.946.284.465.216.40

LungERR(%)5000.941.691.661.641.572.032.091.84
10000.931.841.861.591.562.022.231.79
20000.941.802.001.501.542.202.261.91

#MIS5000.430.760.750.740.710.920.940.83
10000.420.830.840.720.700.911.010.81
20000.430.810.900.680.700.991.020.86

SIZE50058.5721.3618.0321.6012.5411.319.769.39
100060.0517.7317.5725.3112.9511.2310.269.85
200060.8117.4518.0927.8812.5111.9511.8010.10

References
1. Bickel PJ and Levina E (2004). Some theory for fisher’s linear discriminant function, naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989-1010.
2. Cai T and Liu W (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566-1577.
3. Clemmensen L, Hastie T, Witten D, and Ersbøll B (2011). Sparse discriminant analysis. Technometrics, 53, 406-413.
4. Fan J and Fan Y (2008). High dimensional classification using features annealed independence rules. Annals of Statistics, 36, 2605-2637.
5. Fan J, Feng Y, and Tong X (2012). A road to classification in high dimensional space: The regularized optimal affine discriminant. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74, 745-771.
6. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.
7. Fan J and Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32, 928-961.
8. Fan Y and Tang CY (2012). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society Series B: Statistical Methodology, 75, 531-552.
9. Fisher RA (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.
10. Gordon GJ, Jensen RV, and Hsiao LL, et al. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 62, 4963-4967.
11. Hastie T, Tibshirani R, and Buja A (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255-1270.
12. Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Science & Business Media, Berlin.
13. James GM, Radchenko P, and Lv J (2009). Dasso: Connections between the dantzig selector and lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71, 127-142.
14. Kim D, Lee S, and Kwon S (2020). A unified algorithm for the non-convex penalized estimation: The ncpen package. The R Journal, 12, 120-133.
15. Kim Y, Choi H, and Oh H-S (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103, 1665-1673.
16. Kim Y, Jeon J-J, and Han S (2016). A necessary condition for the strong oracle property. Scandinavian Journal of Statistics, 43, 610-624.
17. Kim Y and Kwon S (2012). Global optimality of nonconvex penalized estimators. Biometrika, 99, 315-325.
18. Kwon S and Kim Y (2012). Large sample properties of the scad-penalized maximum likelihood estimation on high dimensions. Statistica Sinica, 22, 629-653.
19. Kwon S, Moon H, Chang J, and Lee S (2021). Sufficient conditions for the oracle property in penalized linear regression. The Korean Journal of Applied Statistics, 34, 279-293.
20. Mai Q, Zou H, and Yuan M (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29-42.
21. Na O and Kwon S (2018). Non-convex penalized estimation for the ar process. Communications for Statistical Applications and Methods, 25, 453-470.
22. Shen X, Pan W, Zhu Y, and Zhou H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65, 807-832.
23. Singh D, Febbo PG, and Ross K, et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203-209.
24. Tibshirani R, Hastie T, Narasimhan B, and Chu G (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99, 6567-6572.
25. Trendafilov NT and Jolliffe IT (2007). Dalass: Variable selection in discriminant analysis via the lasso. Computational Statistics & Data Analysis, 51, 3718-3736.
26. Witten DM and Tibshirani R (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 753-772.
27. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894-942.
28. Zhang C-H and Zhang T (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27, 576-593.
29. Zhao P and Yu B (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541-2563.