TEXT SIZE

CrossRef (0)
High-dimensional linear discriminant analysis with moderately clipped LASSO

Jaeho Changa, Haeseong Moona, Sunghoon Kwon1,a

aDepartment of Applied Statistics, Konkuk University, Korea
Correspondence to: 1Department of Applied Statistics, Konkuk University, Korea, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Korea.
E-mail: shkwon0522@gmail.com

This paper was supported by Konkuk University in 2019.
Received August 18, 2020; Revised October 29, 2020; Accepted November 19, 2020.
Abstract
There is a direct connection between linear discriminant analysis (LDA) and linear regression since the direction vector of the LDA can be obtained by the least square estimation. The connection motivates the penalized LDA when the model is high-dimensional where the number of predictive variables is larger than the sample size. In this paper, we study the penalized LDA for a class of penalties, called the moderately clipped LASSO (MCL), which interpolates between the least absolute shrinkage and selection operator (LASSO) and minimax concave penalty. We prove that the MCL penalized LDA correctly identifies the sparsity of the Bayes direction vector with probability tending to one, which is supported by better finite sample performance than LASSO based on concrete numerical studies.
Keywords : high-dimensional LDA, LASSO, MCP, moderately clipped LASSO
1. Introduction

Linear discriminant analysis (LDA) requires an estimation of the inverse conditional covariance matrix where the pooled sample covariance matrix is a manageable solution. However, the pooled sample covariance matrix is singular when the model is high-dimensional where the number of predictive variables exceeds the sample size. The performance of the LDA is undermined (Krzanowski et al., 1995) and asymptotically no better than the random guessing (Bickel and Levina, 2004) without resolving the singularity. Many works of literature have focused on this issue and suggested other alternatives by directly modifying the singular pooled sample covariance matrix or constructing relevant shrunken centroid means (Guo et al., 2006; Fan and Fan, 2008; Wu et al., 2009; Cai and Liu, 2011; Witten and Tibshirani, 2011; Clemmensen et al., 2011; Shao et al., 2011).

Meanwhile, there is a direct connection between the LDA and linear regression (Hastie et al., 2009) since the direction vector of the LDA can be obtained by using the least square estimation (LSE) (Hastie et al., 2009). However, this connection is lost when the pooled sample covariance matrix is singular. This motivated Mai et al. (2012) to study penalized LDA with the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996). Mai et al. (2012) proved that LASSO penalized LDA finds all the features that significantly contribute to the classification when the model is high-dimensional and sparse, which is a unique theoretical work for the high-dimensional penalized LDA. Mai et al. (2012) provided various numerical studies to confirm that LASSO penalized LDA performs better than other competitors.

In this paper, we propose the penalized LDA with the moderately clipped LASSO (MCL) (Kwon et al., 2015) as an alternative to LASSO. The LASSO has been known to have higher prediction accuracy than minimax concave penalty (MCP) (Zhang, 2010) for finite samples since the shrinkage effect can increase prediction accuracy (Efron and Morris, 1975; Casella, 1985; Zhang and Huang, 2008). However, LASSO has proved to select unnecessary predictive variables even for low-dimensional linear regression (Zou, 2006). Meanwhile, MCL was designed to recover the ideal performance of LASSO by indexing a class of non-convex penalties from LASSO to MCP. Therefore MCL can select relevant predictive variables as MCP while keeping the prediction accuracy of LASSO.

We proved that, with probability tending to one, MCL penalized LDA is the same as the oracle LASSO that is a theoretically optimal LASSO obtained by using relevant predictive variables only. The equivalence can be obtained using LASSO penalized LDA as proved by Mai et al. (2012). However, MCL does not require the Strong Irrepresentable condition (Zhao and Yu, 2006) on the marginal covariance matrix, and represents a potential theoretical advantage of MCL over LASSO. We provided various numerical studies to show how MCL performs with finite samples as a better alternative to LASSO for the high-dimensional penalized LDA.

The rest of the paper is organized as follows. Section 2 introduces the penalized LDA. Section 3 introduces MCL and related statistical properties. Section 4 shows the results of numerical studies. Relevant proofs are given in the Appendix.

2. Penalized linear discriminant analysis

### 2.1. Linear discriminant analysis and least square estimation

Let X ∈ ℝp be a p-dimensional random vector of predictive variables and C ∈ {1, 2} a random class label to be identified. Given X = x, the Bayes classifier becomes

$φBayes(x)=arg maxc∈{1,2} P(C=c∣X=x),$

which minimizes the misclassification error, P(Cφ(X)), over the set of classifiers, φ : ℝp → {1, 2}. The LDA (Fisher, 1936) assumes that

$X∣C=c~Np(μc,Σ), c∈{1,2},$

where μc and denote the mean vector and covariance matrix of the p-dimensional normal distribution. Then the Bayes classifier is equivalent to the Bayes discriminant rule: the class of C given X = x becomes 2 if x satisfies

$(x-μ1+μ22)T βBayes+log (π2π1)>0,$

where πc = P(C = c), c ∈ {1, 2} are the class probabilities and

$βBayes=Σ-1θ$

is the Bayes direction vector with θ = μ2μ1.

Let ci ∈ {1, 2} and xi = (xi1, . . . , xip)T ∈ ℝp, i n, be n samples of the class label and predictive variables. For each class label c ∈ {1, 2}, let ∑̂c = ∑ci=c(xiμ̂c)(xiμ̂c)T /(nc − 1) be the sample covariance matrix, μ̂c = ∑ci=cxi/nc the sample mean vector, and $nc=Σi=1nI(ci=c)$ the number of samples with class label c. The LDA estimates the Bayes direction vector with the LDA direction vector:

$β^LDA=Σ^-1θ^,$

where ∑̂ = ∑c∈{1,2}(nc − 1)∑̂c/(n − 2) is the pooled sample covariance matrix and θ̂ = μ̂2μ̂1. These arguments lead to the linear discriminant rule: the class of C given X = x becomes 2 if x satisfies

$(x-μ^1+μ^22)T β^LDA+log (n2n1)>0,$

where the class probabilities are estimated by the sample class proportions, π̂c = nc/n, c ∈ {1, 2}.

There is an intimate connection (Hastie et al., 2009) between the LDA and LSE when pn:

$β^LSE=cβ^LDA$

for some constant c > 0, where

$ary (α^LSE,β^LSE)=arg minα,β∑i=1n(yi-α-xiTβ)22n$

ary and yi = (−1)cin/nci, i n. Hence the linear discriminant rule in (2.3) is the same as:

$(x-μ^1+μ^22)T β^LSE+c log (n2n1)>0,$

which implies that we can cast the LDA into the framework of the LSE.

### 2.2. Penalized linear discriminant analysis and least square estimation

The equations in (2.2) and (2.4) fail to hold when p > n since the pooled sample covariance matrix is singular, which raises a challenging problem of estimating the Bayes direction vector. For the problem, Mai et al. (2012) proposed to estimate the Bayes direction vector by using LASSO:

$(α^λ,β^λ)=arg minα,β {∑i=1n(yi-α-xiTβ)22n+λ∑j=1p∣βj∣},$

for some λ > 0, where the solution can be defined even when p > n. Let Z = (Z1, . . . , Zp) be the centered design matrix for the LSE in (2.5), where Zj = (IΠ1)Xj with Xj = (x1 j, . . . , xn j)T, j p, and Π1 is the projection matrix onto 1 = (1, . . . , 1)T ∈ ℝn. Then we can see that the solution can be obtained by

$β^λ=arg minβ {βTZTZβ2n-θ^Tβ+λ∑j=1p∣βj∣}.$

In addition, Mai et al. (2012) constructed an optimal discriminant rule for the penalized estimation: the class of C given X = x becomes 2 if x satisfies

$(x-μ^1+μ^22)Tβ^λ+β^λTΣ^β^λθ^Tβ^λlog (n2n1)>0,$

whenever θ̂Tβ̂λ > 0. Note that the classification rule in (2.8) can be used for any linear classifier, see Proposition 2 in Mai et al. (2012) for some details. For example, the optimal discriminant rule in (2.8) reduces to the linear discriminant rule in (2.6) when pn and λ = 0 since β̂λ = β̂LSE = c∑̂−1θ̂ implies β̂λT∑̂ β̂λ/θ̂Tβ̂λ = c.

3. Penalized LDA with moderately clipped LASSO

### 3.1. Definition

A natural extension for LASSO in (2.7) is the non-convex penalized estimation. For example, we can use the MCP (Zhang, 2010) that has been preferred for variable selection than LASSO due to the oracle property (Fan and Li, 2001; Kim et al., 2008; Zhang, 2010). However, LASSO has been preferred for prediction than MCP since shrinkage effect often produces better prediction accuracy (Efron and Morris, 1975). As an alternative, we propose to use MCL (Kwon et al., 2015) for the penalized LDA:

$β^λ,γ=arg minβQλ,γ(β),$

where

$Qλ,γ(β)=βTZTZβ2n-θ^Tβ+∑j=1pJλ,γ(∣βj∣)$

and Jλ,γ is MCL that satisfies Jλ,γ(0) = 0 and

$dJλ,γ(t)dt=∇Jλ,γ(t)=max {λ-ta,γ}, t>0$

for some a > 1 and λγ ≥ 0.

Note that ∇Jλ,γ(t) = λt/a for t < a(λγ) and ∇Jλ,γ(t) = γ for ta(λγ). Hence, MCL becomes a smooth interpolation between MCP and LASSO with two tuning parameters λ ≥ 0 and 0 ≤ γλ. The two tuning parameters of MCL play different roles. First, λ controls the concavity of MCL near the origin for a fixed γ as it does in MCP. In the right panel of Figure 1, we can see that the concavity of MCL increases as λ does. Second, γ regularizes the amount of shrinkage in large non-zero regression coefficients for a fixed λ as it does in LASSO, which is illustrated in the left panel of Figure 1. These imply that MCL can control the sparsity and shrinkage effect simultaneously. Hence, we can expect an estimator that balances between MCP and LASSO for given finite samples as studied in Kwon et al. (2015) for high dimensional linear regression.

### 3.2. Asymptotic properties

In this subsection, we provide some asymptotic properties for MCL. We assume that there is a nonempty subset such that $βjBayes≠0$ for and $βjBayes=0$ for . The main results imply that MCL is asymptotically equivalent to a theoretical estimator, the oracle LASSO:

$β^oL,γ=arg minβj=0,j∈A {βTZTZβ2n-θ^Tβ+γ∑j=1p∣βj∣},$

for some γ ≥ 0. Note that the oracle LASSO is simply the oracle LSE (Fan and Li, 2001) if γ = 0. The oracle LASSO is not available in practice since is unknown. However, the oracle LASSO plays an important role in developing asymptotic properties of MCL as studied in Mai et al. (2012) and Kwon et al. (2015). We can observe similar frameworks in other studies on the non-convex penalized estimation (Kim et al., 2008; Zhang, 2010; Kim and Kwon, 2012).

Before proceeding, we define some notations. For any vector a = (ak, ks) and subset , let $‖a‖1=Σk=1s∣ak∣, ‖a‖2=(Σk=1sak2)1/2$, ||a|| = maxks |ak|, supp(a) = {k : ak ≠ 0} and . For any matrix A = (Ak j, ks, jt) and subsets and , let λmin(A) be the minimum eigenvalues of symmetric A, $‖A‖∞=maxk≤s Σj=1t|Akj|$, , and the cardinality of .

We first introduce a lemma that gives sufficient conditions for the uniqueness of a minimizer of Qλ,γ when pn. Let Ξλ,γ be the set of all local minimizers of Qλ,γ and ρ = λmin(ZTZ/n).

Lemma 1

If β̂ satisfies

$θ^j-ZjTZβ^n=sign (β^j)∇Jλ,γ (∣β^j∣), j∈S,|θ^j-ZjTZβ^n|≤λ, j∈Sc,$

then {β̂} = Ξλ,γ, provided that ρ > 1/a, where.

Remark 1

Lemma 1 is a slight modification of the second order Karush-Kuhn-Tucker sufficient conditions for Qλ,γ whose proof can be found in other literature (Fan et al., 2014; Kwon et al., 2015). Let ∇̃2Jλ(t), t > 0 be the maximum concavity (Zhang, 2010) of the penalty Jλ,γ, that is,

$∇˜2Jλ,γ(t)=limɛ→0+ inft-ɛ0.$

Note that Qλ,γ is globally convex if ρ + inft>0 ∇̃2Jλ,γ(t) = ρ − 1/a > 0, which implies β̂ is a unique minimizer of Qλ,γ.

Lemma 1 implies that Qλ,γ has a unique minimizer when pn, although Jλ,γ is not convex. However, when p > n, Lemma 1 fails to hold for any a > 1 since Z does not have full rank. Therefore, we present a slightly weaker result under the Sparse Riesz condition (Zhang, 2010; Kim and Kwon, 2012; Kim et al., 2016). The next lemma shows that there exists a unique minimizer in a restricted parameter space, which is a direct application of Theorem 3 in Kim and Kwon (2012). Let $Ξλ,γκ={β∈Ξλ,γ:∣supp(β)∣≤κ}$ and $ρ^κsrc=min∣D∣≤2κ λmin(ZDTZD/n)$ for some κ > 0.

### Lemma 2

If β̂ satisfies

$θ^j-ZjTZβ^n=sign (β^j)∇Jλ,γ (|β^j|), j∈S,|θ^j-ZjTZβ^n|≤λ, j∈Sc,$

then${β^}=Ξλ,γκ$, provided that$ρ^κsrc>1/a$and, where.

Lemma 2 gives sufficient conditions for a minimizer to be unique in $Ξλ,γκ$ under the Sparse Riesz condition, and we can construct similar conditions for the oracle LASSO in (3.2) by adding one more condition.

### Lemma 3

If β̂oL,γ satisfies

$|β^joL,γ| >a(λ-γ), j∈A,θ^j-ZjTZβ^oL,γn =γsign (β^joL,γ), j∈A,|θ^j-ZjTZβ^oL,γn| ≤λ, j∈Ac,$

then${β^oL,γ}=Ξλ,γκ$, provided that$ρ^κsrc>1/a$and.

Remark 2

Note that the second condition in Lemma 3 is equivalent to the first condition in Lemma 2 under the first condition in Lemma 3 so that Lemma 3 is a corollary of Lemma 2.

We now present the main results that show the oracle LASSO satisfies the conditions in Lemma 3 asymptotically so that the oracle LASSO is equivalent to MCL defined in (3.1). Let Ω = Cov(X) be the marginal covariance matrix of X. Let β* be a vector that satisfies $βA*=ΩAA-1θA$ and $βN*=0$. For the study, we need regularity conditions below:

• (C1) There are positive constants, bi, i ≤ 3, such that

$‖ΩNAΩAA-1‖∞≤b1, ‖ΩAA-1‖∞≤b2, and ‖θA‖∞

• (C2) The model and tuning parameters satisfy

$q=o (nmA2), log p=o (nmA2q2), λ=o (mA), γ=o(λ), and nmA2→∞$

as n→∞, where and $mA=minj∈A∣βj*∣$.

• (C3) There exists a sequence qκn/2 such that

$lim infn→∞ ρκsrc>1a, log p=o (nκ3), and nκ2→∞$

as n→∞, where $ρκsrc=min|D|≤2κ λmin(ΩDD)$.

Remark 3

Condition (C1) is given by Mai et al. (2012) which requires b1 = 1 for LASSO penalized LDA to be selection consistent. The restriction corresponds to the Strong Irrepresentable condition given by Zhao and Yu (2006) for LASSO penalized linear regression. Condition (C2) includes technical assumptions that are standard for the non-convex penalized regression, see Remark 3.2. Note that β* is a scaled version of βBayes for the LSE framework that satisfies β* = cβBayes for some constant c > 0, see Proposition 3 in Mai et al. (2012) for some details. Condition (C3) assumes that any principal sub-matrix of Ω is non-singular whenever which corresponds to the sparse Riesz condition in Zhang (2010) imposed on the design matrix in the linear regression.

### Theorem 1

Under (C1)–(C3), the oracle LASSO is unique minimizer of Qλ,γ with probability tending to one, in the sense that

$limn→∞ P ({β^oL,γ}=Ξλ,γκ)=1.$
Remark 4

Theorem 1 holds for γ = 0 which proves that the oracle LSE becomes the unique minimizer of Qλ,γ when the penalty is MCP.

Remark 5

The conditions (C2) and (C3) can be simplified as

$mA≫λ≫qlog pn$

if nκ3 log p, where ab implies a/b→∞ as n→∞. However, we need

$mA≫λ≫log pn,$

for the high-dimensional linear regression (Kwon et al., 2015). The difference between the minimum signal sizes, up to a factor q, happens since the design matrix is high-dimensional and random for the LDA.

4. Numerical studies

In this section, we present the results of numerical studies including simulations and real data analysis. We obtained all the penalized estimators using R package ncpen (Kim et al., 2020) that was developed based on the convex-concave procedure (Yuille and Rangarajan, 2002) and coordinate descent algorithm (Mazumder et al., 2011).

### 4.1. Simulation studies

The simulation studies were based on the conditional distribution in (2.1) with various scenarios. We considered two different class probabilities: π1 = 0.5 (balanced case) and π1 = 0.67 (unbalanced case). Given the class label, we also considered two different conditional covariance matrices: = (1) and = (2), where $Σjk(1)=I(i=j)$ (no correlation) and $Σjk(2)=0.5|j-k|$, j, kp (power decaying correlation). For the conditional mean vectors, we set μ1 = 0p and μ2 = βBayes, where $βjBayes=α(-1)jI(j≤q)$, jp and α was set for the Bayes misclassification error rate to be 0.2 for all scenarios.

We set n ∈ {300, 600, 900}, p ∈ {1000, 2000}, and q ∈ {5, 10} to compare finite sample performance of MCL with LASSO, including MCP as a special case of MCL. We considered three cases for MCL: MCLk with a = 2.1 and γ = kλ̂opt, k ∈ {1, 2, 3}, where λ̂opt is the best tuning parameter value for LASSO (Kwon et al., 2015). We also considered three cases of MCP: MCPk with a = k + 0.1, k ∈ {1, 2, 3} and γ = 0.

For each method, we used n training samples for direction vector estimation and n independent validation samples for selecting the tuning parameter λ. We checked the selection performance by counting the number of true positive selection (TPS), the number of false positive selection (FPS) and indicator of correct model identification (CMI): $TPS=Σj≤qI(β^j≠0,βjBayes≠0),FPS=Σj>qI(β^j≠0,βjBayes=0)$, and CMI = I(TPS = q, FPS = pq). Furthermore, we compared the misclassification error rate (ERR) obtained from 2n independent test samples by using the classification rule in (2.8). We repeated each simulation 200 times and summarized the averages of the four measures in Table 1 and Table 2, including graphical illustrations in Figure 2 and

Table 1 shows the results when = (1). First, TPS converged to q as n increased for all methods while LASSO performed best for each fixed n. Second, FPS decreased as n increased only for MCL and MCP, leading to an increase in CMI. Hence, our theoretical results were supported by the results since MCL and MCP correctly fitted the model as n increased while LASSO overfitted the model regardless of sample size. These trends continued even under the class imbalance and dimension increase. Third, ERR tended to decrease as n increased for all methods. Especially MCL1 attained the lowest ERR for each fixed n and was followed by MCL2 and LASSO.

Table 2 shows the results when = (2), where MCPs performed the best for all measures. All selection performance measures tended to be undermined but the patterns were similar in that MCL and MCP fitted the model correctly while LASSO overfitted. For ERR, MCPs performed the best and MCL1 ranked the second for all cases.

We conclude that MCL can be a nice alternative to LASSO for the high-dimensional penalized LDA. The MCL can correctly identify the sparse Bayes direction vector, keeping almost the same prediction accuracy as LASSO. The MCL1 performed quite well regardless of the simulation designs considered in this paper, which aligns with the recommendation for the heuristic choice of γ = λ̂opt in the linear regression (Kwon et al., 2015).

### 4.2. Analysis of micro-array samples

The R package datamicroarray (John, 2016) provides a collection of high-dimensional microarray data sets. We chose four data sets (Burczynski et al., 2006; Chin et al., 2006; Chowdary et al., 2006; Gordon et al., 2002) to illustrate how MCL performs with real samples. We first applied the sure independence screening procedure (Fan and Song, 2010) so that the first top d ∈ {400, 800, 1600} predictive variables with the largest marginal regression coefficients were used for the penalized LDA. For comparison, we applied the leave-one-out cross-validation procedure for each data set and calculated the number of incorrectly classified samples (error) as well as the average of the numbers of non-zero regression coefficients (sizes), where the tuning parameters were selected by the 10-fold cross-validation procedure.

Table 3 summarizes the results. In most cases, LASSO had the best prediction accuracy selecting the most variables. The MCPs had the worst prediction accuracy but selected the least variables. The best prediction accuracy occurred when d = 400 for all cases where MCL1 had the same prediction accuracy as LASSO while selecting fewer variables than LASSO. In addition, MCL2 selected fewer variables than LASSO without losing prediction accuracy much.

5. Concluding remarks

In this paper, we studied the high dimensional penalized LDA with MCL. The nature of MCL produces the same shrinkage effect as LASSO and makes the selection process the same as MCP. Therefore MCL shows similar or better prediction accuracy than LASSO correctly recovering the sparsity of the direction vector. We proved that MCL is selection consistent under reasonable regularity conditions, which was supported by various numerical experiments. One disadvantage of MCL compared with LASSO may be the additional tuning parameter γ. But the heuristic choice of γ = γ̂opt or γ = 2γ̂opt performed well as shown in the numerical studies. Further research on MCL could focus on the theoretical justification of the choice of γ, which is not addressed in this paper.

Acknowledgement
We gratefully acknowledge the helpful comments of the Associate Editor and referees that substantially improved the paper. This paper was supported by Konkuk University in 2019.
Figures
Fig. 1. Various shapes of MCL with a = 2: λ = 1 for the left panel and γ = 0.1 for the right panel.
Fig. 2. Averages of the four measures when = (1).
Fig. 3. Averages of the four measures when = (2).
TABLES

### Table 1

Averages of the four measures when = (1)

np = 1000, q = 5p = 2000, q = 10

π1 = 0.50π1 = 0.67π1 = 0.50π1 = 0.67

300600900300600900300600900300600900
TPSBayes555555101010101010
Lasso5.0005.0005.0004.9905.0005.0009.76010.00010.0009.1409.98510.000
MCL14.9955.0005.0004.9455.0005.0009.0259.97010.0008.1109.7909.995
MCL24.9055.0005.0004.6104.9955.0007.7859.8559.9855.6159.2859.915
MCL33.9754.9805.0002.6454.7954.9953.4058.2059.6401.2255.2108.255
MCP14.8655.0005.0004.5805.0005.0006.5959.5809.9705.2009.0509.855
MCP24.9655.0005.0004.9005.0005.0008.4509.9409.9957.4409.7109.975
MCP34.9805.0005.0004.9405.0005.0009.1559.97010.0008.2509.8609.995

FPSBayes000000000000
Lasso22.62022.53522.77024.50024.93024.56541.92044.76542.53541.70045.61545.305
MCL11.6551.5551.6052.1701.8952.4355.0601.5901.1109.7652.6701.200
MCL20.0650.0150.0050.1000.0150.0150.4250.1000.0200.4300.1050.025
MCL30.0000.0000.0000.0000.0000.0000.0150.0000.0000.0100.0000.000
MCP10.7050.3850.3700.7800.4100.3551.4201.0500.4101.2751.3200.735
MCP26.7203.7951.9857.1805.1752.8358.56012.2859.3808.88512.27010.625
MCP38.7907.6705.88510.0909.0107.33515.79016.19513.51016.38518.16515.685

CMIBayes111111111111
Lasso000000000000
MCL10.6100.5500.5850.4550.5500.5900.0200.4350.5900.0000.3000.595
MCL20.8550.9850.9950.6200.9800.9850.0600.7900.9650.0000.4300.900
MCL30.3900.9801.0000.0800.8150.9950.0000.1550.7150.0000.0000.170
MCP10.4750.7250.7250.2350.6950.7350.0000.2300.6700.0000.1000.415
MCP20.0250.0900.3850.0100.0750.1900.0000.0000.0000.0000.0000.005
MCP30.0000.0250.0650.0050.0250.0650.0000.0000.0000.0000.0000.000

ERRBayes0.20.20.20.20.20.20.20.20.20.20.20.2
Lasso0.21690.20640.20460.22040.20820.20360.24420.21680.21020.25710.22230.2116
MCL10.20850.20300.20310.21140.20430.20180.23720.20980.20560.25210.21530.2059
MCL20.21880.20540.20370.22610.20730.20270.26530.22020.21010.27690.23060.2132
MCL30.25760.21250.20590.27390.22130.20720.34660.25820.22590.32350.28160.2433
MCP10.21070.20200.20280.21590.20290.20080.26440.21270.20450.27060.21960.2051
MCP20.21630.20320.20270.22010.20480.20150.25590.21670.20780.26470.22400.2095
MCP30.21590.20520.20340.21890.20660.20280.24980.21650.20940.26000.22320.2102

TPS = number of true positive selection; FPS = number of false positive selection; CMI = correct model identification; ERR = classification error rate.

### Table 2

Averages of the four measures when = (2)

np = 1000, q = 5p = 2000, q = 10

π1 = 0.50π1 = 0.67π1 = 0.50π1 = 0.67

300600900300600900300600900300600900
TPSBayes555555101010101010
Lasso4.9605.0005.0004.7505.0005.0007.9459.97010.0005.3109.7859.995
MCL14.8605.0005.0004.4304.9905.0006.1459.8159.9854.1008.9159.940
MCL23.5904.9305.0002.8754.5854.9853.6158.3059.8251.6905.9608.700
MCL32.2253.1254.4351.7852.5853.3951.6203.4605.3850.3552.4103.425
MCP14.8755.0005.0004.4804.9905.0005.2009.7159.9953.1958.8109.935
MCP24.9605.0005.0004.8505.0005.0006.6759.96010.0004.5709.6359.995
MCP34.9655.0005.0004.8705.0005.0007.2209.93010.0005.2159.6759.995

FPSBayes000000000000
Lasso42.99543.62044.89043.26049.45047.89564.93589.36090.53539.15092.30597.670
MCL11.6550.5951.0153.8600.8600.78010.1202.3150.70510.9556.4951.555
MCL20.7800.1650.1500.6750.3900.1402.3602.0400.5950.8602.5851.680
MCL30.0550.0050.0050.0450.0000.0050.2150.0500.0150.0400.0100.005
MCP10.7900.4000.3201.3100.4350.3751.5701.2650.4401.3402.0100.830
MCP24.1601.4751.8056.8352.2101.44512.1757.8753.14010.27514.3154.255
MCP313.3304.3602.81517.5306.7853.18027.17526.07511.66522.63036.21519.085

CMIBayes111111111111
Lasso000000000000
MCL10.3400.6650.6300.1100.6050.6200.0000.2050.6050.0000.0100.340
MCL20.0600.8000.8950.0100.4350.8900.0000.0250.5050.0000.0000.090
MCL30.0000.0800.6050.0000.0150.1450.0000.0000.0150.0000.0000.000
MCP10.4750.6650.7550.2100.6900.7350.0000.2950.7000.0000.0800.475
MCP20.1000.4800.5950.0500.4000.5550.0000.0300.2550.0000.0000.095
MCP30.0000.1050.4100.0050.0450.2150.0000.0000.0000.0000.0000.000

ERRBayes0.20.20.20.20.20.20.20.20.20.20.20.2
Lasso0.26000.22830.22150.26620.23260.21990.33530.25360.23170.33460.26810.2387
MCL10.23640.21760.21570.24650.21710.21250.31910.22830.21530.32340.24800.2176
MCL20.28960.24550.23020.28000.25030.23060.35700.28810.25040.32670.29090.2631
MCL30.31630.29880.27540.29650.27950.27170.40020.34800.33130.33200.31270.3048
MCP10.22140.21250.21210.22980.21050.20850.30440.21750.20950.30810.23110.2097
MCP20.21920.21270.21210.22260.21070.20850.28430.21340.20940.30300.22280.2081
MCP30.22720.21350.21200.23140.21220.20900.29030.21990.21140.30950.23070.2115

TPS = number of true positive selection; FPS = number of false positive selection; CMI = correct model identification; ERR = classification error rate.

### Table 3

number of incorrectly classified samples and average of the model sizes

dLASSOMCL1MCL2MCL3MCP1MCP2MCP3
Chowdary et al. (2006), n = 104, p = 22283Errors4001122445
8002224363
16003422776

Sizes40036.2828.5024.7821.8411.4411.6012.33
80035.2025.4623.1020.1710.1010.7813.38
160037.2927.8524.7521.069.3110.6310.53

Gordon et al. (2002), n = 181, p = 12533Errors4002221333
8002222222
16002233332

Sizes40041.2837.3324.4116.717.176.797.03
80044.3137.7222.4615.3311.7112.2213.08
160049.0642.3828.4625.5019.3116.5015.75

Burczynski et al. (2006), n = 127, p = 22283Errors4006679151315
800891316131417
1600891013171616

Sizes40047.6741.5026.1924.0613.7013.8113.13
80050.0443.9225.4519.1312.0211.8110.75
160049.3543.6922.5018.5917.1417.8314.77

Chin et al. (2006), n = 118, p = 22215
Errors40012111315212222
80015161618212120
160014181514131313

Sizes40033.6227.2821.5315.169.749.699.61
80041.4831.9823.9816.365.225.044.97
160033.2218.9413.8112.804.004.114.80

References
1. Bickel PJ and Levina E (2004). Some theory for fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989-1010.
2. Burczynski ME, Peterson RL, and Twine NC, et al. (2006). Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. The Journal of Molecular Diagnostics, 8, 51-61.
3. Cai T and Liu W (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566-1577.
4. Casella G (1985). An introduction to empirical bayes data analysis. The American Statistician, 39, 83-87.
5. Chin K, DeVries S, and Fridlyand J, et al. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell, 10, 529-541.
6. Chowdary D, Lathrop J, and Skelton J, et al. (2006). Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. The Journal of Molecular Diagnostics, 8, 31-39.
7. Clemmensen L, Hastie T, Witten D, and Ersbøll B (2011). Sparse discriminant analysis. Technometrics, 53, 406-413.
8. Efron B and Morris C (1975). Data analysis using stein’s estimator and its generalizations. Journal of the American Statistical Association, 70, 311-319.
9. Fan J and Fan Y (2008). High dimensional classification using features annealed independence rules. Annals of Statistics, 36, 2605-2637.
10. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.
11. Fan J and Song R (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38, 3567-3604.
12. Fan J, Xue L, and Zou H (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42, 819.
13. Fisher RA (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.
14. Gordon GJ, Jensen RV, and Hsiao LL, et al. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 62, 4963-4967.
15. Guo Y, Hastie T, and Tibshirani R (2006). Regularized linear discriminant analysis and its application in microarrays. Biostatistics, 8, 86-100.
16. Hastie T, Tibshirani R, and Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media.
17. John AR (2016) datamicroarray: Collection of Data Sets for Classification .
18. Kim D, Lee S, and Kwon S (2020). A unified algorithm for the non-convex penalized estimation: The ncpen package. The R Journal. Accepted
19. Kim Y, Choi H, and Oh HS (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103, 1665-1673.
20. Kim Y, Jeon JJ, and Han S (2016). A necessary condition for the strong oracle property. Scandinavian Journal of Statistics, 43, 610-624.
21. Kim Y and Kwon S (2012). Global optimality of nonconvex penalized estimators. Biometrika, 99, 315-325.
22. Krzanowski W, Jonathan P, McCarthy W, and Thomas M (1995). Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 44, 101-115.
23. Kwon S, Lee S, and Kim Y (2015). Moderately clipped lasso. Computational Statistics & Data Analysis, 92, 53-67.
24. Mai Q, Zou H, and Yuan M (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29-42.
25. Mazumder R, Friedman JH, and Hastie T (2011). Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106, 1125-1138.
26. Shao J, Wang Y, Deng X, and Wang S (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics, 39, 1241-1265.
27. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.
28. Witten DM and Tibshirani R (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 753-772.
29. Wu MC, Zhang L, Wang Z, Christiani DC, and Lin X (2009). Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics, 25, 1145-1151.
30. Yuille AL and Rangarajan A (2002). The concave-convex procedure (cccp). Advances in Neural Information Processing Systems, 1033-1040.
31. Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894-942.
32. Zhang CH and Huang J (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36, 1567-1594.
33. Zhao P and Yu B (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7, 2541-2563.
34. Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418-1429.