TEXT SIZE

CrossRef (0)
Non-convex penalized estimation for the AR process

Okyoung Naa, Sunghoon Kwon1,b

aDepartment of Applied Statistics, Kyonggi University, Korea;
bDepartment of Applied Statistics, Konkuk University, Korea
Correspondence to: 1 Department of Applied Statistics, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Korea. E-mail: shkwon0522@konkuk.ac.kr
Received January 22, 2018; Revised April 3, 2018; Accepted April 4, 2018.
Abstract

We study how to distinguish the parameters of the sparse autoregressive (AR) process from zero using a non-convex penalized estimation. A class of non-convex penalties are considered that include the smoothly clipped absolute deviation and minimax concave penalties as special examples. We prove that the penalized estimators achieve some standard theoretical properties such as weak and strong oracle properties which have been proved in sparse linear regression framework. The results hold when the maximal order of the AR process increases to infinity and the minimal size of true non-zero parameters decreases toward zero as the sample size increases. Further, we construct a practical method to select tuning parameters using generalized information criterion, of which the minimizer asymptotically recovers the best theoretical non-penalized estimator of the sparse AR process. Simulation studies are given to confirm the theoretical results.

Keywords : autoregressive process, subset selection, non-convex penalty, oracle property, tuning parameter selection
1. Introduction

The autoregressive (AR) process is a basic and important processes for time series data analysis. The usual least square estimation may yield severe modeling biases when the AR model is sparse including zero parameters. Various information criteria (Akaike, 1969, 1973, 1979; Schwarz, 1978; Hannan and Quinn, 1979; Claeskens and Hjort, 2003) have been proposed to identify true non-zero parameters of the AR process, which we call subset selection problem in this paper. Theoretical properties such as asymptotic efficiency and selection consistency of the final sub-process from these information criteria have also been investigated (Shibata, 1976; Hannan and Quinn, 1979; Tsay, 1984; Claeskens and Hjort, 2003; Claeskens et al., 2007). Recently, Na (2017) introduced the generalized information criterion (GIC) (Kim et al., 2012) for the AR process that includes most of information criteria such as Akaike information criterion (AIC) (Akaike, 1973), Hannan-Quinn criterion (HQC) (Hannan, 1980) and Bayesian information criterion (BIC) (Schwarz, 1978). Na (2017) proved that there is a large class of GICs that is selection consistent, including the BIC as an example. However, these approaches suffer from computational complexity since it is almost impossible to compare all the candidate sub-processes when the maximal order is very large (McClave, 1975; Sarkar and Kanjilal, 1995; Chen, 1999; McLeod and Zhang, 2006).

For years, penalized estimation has been studied as an alternative for the subset selection problem (Nardi and Rinaldo, 2011; Schmidt and Makalic, 2013; Sang and Sun, 2015; Kwon et al., 2017). The penalized estimation has nice asymptotic properties such as selection consistency and minimax optimality for various statistical models that include the generalized linear regression model (Fan and Peng, 2004; Zou, 2006; Ye and Zhang, 2010; Kwon and Kim, 2012). However, the advantage of the penalized estimation comes from the efficiency of the computation since there exist many fast and efficient algorithms (Friedman et al., 2007; Kim et al., 2008; Lee et al., 2016). Hence we need not to exhaustively search all the possible candidate sub-models when the AR process has very large model order.

There are many possible penalty functions for the penalized estimation such as the least absolute selection and shrinkage operator (LASSO) (Tibshirani, 1996) and smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001). There are some non-convex penalties including the SCAD that have very distinct advantages against others such as the bridge (Huang et al., 2008; Kim et al., 2016) and log (Zou and Li, 2008; Kwon et al., 2016). First, they produce unbiased estimators of the parameters that help us understand the final model without considering the penalty effect. Second, they exactly select the non-zero parameters with probability tending to one which is impossible for other penalty functions such as the LASSO (Zhang, 2010b; Kim and Kwon, 2012; Zhang and Zhang, 2012) and ridge.

In this paper, we study the subset selection problem by using the non-convex penalized estimation used to identify non-zero parameters in various statistical models (Fan and Li, 2001; Zhang, 2010a; Kwon and Kim, 2012; Shen et al., 2013). A large class of non-convex penalties is considered that includes the SCAD and minimax concave penalties (MCP) (Zhang, 2010a) as examples (Kim and Kwon, 2012; Zhang and Zhang, 2012). We first prove several asymptotic properties of the non-convex penalized estimators such as the weak and strong oracle properties that are standard in sparse linear regression framework (Kim et al., 2016). Second, we prove that optimal tuning parameters in the penalty can be chosen by using an information criterion of which the minimizer exactly identifies true zero and non-zero parameters asymptotically.

The results hold when the candidate maximal order of the AR process increases to infinity and the minimal size of the true non-zero parameters decreases toward zero as the sample size increases. Further, we consider a class of error processes for the AR process that includes independently and identically distributed (iid), autoregressive conditional heteroscedastic (ARCH) and generalized ARCH (GARCH) processes, which is large enough to cover most of the recent and related literature (Nardi and Rinaldo, 2011; Schmidt and Makalic, 2013; Sang and Sun, 2015; Kwon et al., 2017).

We introduce the non-convex penalized estimation for the AR process in Section 2. Asymptotic properties of the penalized estimator are presented in Section 3, introducing an information criterion for the tuning parameter selection. Simulation studies and proofs are given in Section 4 and Appendix, respectively.

2. Non-convex penalized estimation for the AR process

Consider the AR process {yt, t ∈ ℤ},

$yt-μ=∑j=1pβj(yt-j-μ)+ɛt, t∈ℤ,$

where p is a positive integer, μ ∈ ℝ and β = (β1, . . . , βp)T ∈ ℝp are parameters of interest, {ɛt, t ∈ ℤ} is an error process and ℝ and ℤ are the set of real numbers and integers, respectively. We assume that the process is sparse, that is, βj = 0 for some , where denotes the full index set of regression parameters. In this case, we can estimate the true non-zero index set by minimizing the penalized sum of squared residuals

$Lλ(u,b)=∑t=p+1n{(yt-u)-∑j=1pbj(yt-j-u)}2+2(n-p)∑j=1pJλ(∣bj∣)$

with respect to u ∈ ℝ and b = (b1, . . . , bp)T ∈ ℝp. Here, Jλ is a non-convex penalty with tuning parameter λ > 0 that is included in the penalty class defined in the next section.

Let $y¯j=∑t=p+1nyt-j/(n-p)$ then Lλ(u, b) can be decomposed as:

$Lλ(u,b)=LλA(u,b)+LλB(b),$

where $LλA(u,b)=(n-p){y¯0-∑j=1pbjy¯j-(1-∑j=1pbj)u}2$ and

$LλB(b)=∑t=p+1n{(yt-y¯0)-∑j=1pbj(yt-j-y¯j)}2+2(n-p)∑j=1pJλ(∣bj∣).$

Let (μ̂λ,A,β̂λ,B) be the minimizer of Lλ(u, b) then it is easy to see that

$β^λ,B=(β^1λ,B,…,β^pλ,B)T=arg minb∈ℝp LλB(b)$

and $μ^λ,A=(y¯0-∑j=1pβ^jλ,By¯j)/(1-∑j=1pβ^jλ,B)$. Once β̂λB is obtained, we can estimate the true index set by using the set {$j:β^jλ,B≠0$ }. We often estimate μ by using the sample mean $y¯=∑t=1nyt/n$ before estimating β, which is the same as estimating β based on the centered samples yt, t = 1, . . . , n. In this case, the penalized estimator, β̂λ,C, can be defined as

$β^λ,C=(β^1λ,C,…,β^pλ,C)T=arg minb∈ℝp LλC(b),$

where $LλC(b)=Lλ(y¯,b)$.

In this paper, we define the penalized estimator of the regression parameter β as

$β^λ=(β^1λ,…,β^pλ)T=arg minb∈ℝp LλG(b;m),$

where

$LλG(b;m)=Q(b;m)+2(n-p)∑j=1pJλ(∣bj∣),$

m = (m0,m1, . . . ,mp)T ∈ ℝp+1 and

$Q(b;m)=∑t=p+1n{(yt-m0)-∑j=1pbj (yt-j-mj)}2.$

The definition is general enough to include above two special cases. If we set m = (0, . . . , p)T and m = (, . . . , )T, then $LλG(b;m)$ becomes $LλB(b)$ and $LλC(b)$, respectively.

3. Asymptotic properties

In this section, we investigate some asymptotic properties such as weak and strong oracle properties that have been proved in sparse linear regression framework. The results hold when p → ∞ and as n → ∞. Further, we propose the GIC that asymptotically recovers the best theoretical non-penalized estimator of the AR process.

3.1. Model assumptions and penalty class

We assume the following conditions (E1)–(E5):

• (E1) {yt, t ∈ ℤ} is stationary and there exists an absolutely summable sequence {ψi, i ∈ ℕ} such that

$yt-μ=ɛt+∑i=1∞ψiɛt-i, t∈ℤ,$

where ℕ is the set of natural numbers.

• (E2) {ɛt, t ∈ ℤ} is a white noise with mean 0 and positive variance $σɛ2$.

• (E3) {ɛt, t ∈ ℤ} is a sequence of martingale differences with respect to a filtration { , t ∈ ℤ}.

• (E4) {ɛt, t ∈ ℤ} takes the form

$ɛt=g(ηt,ηt-1,…),$

where ηt, t ∈ ℤ, are iid random variables and g is a measurable function.

• (E5) E(|ɛ1|ν) < ∞ and {ɛt, t ∈ ℤ} is ν-strong stable with ν ≥ 2 (Wu, 2005). Here, ν-strong stability means that

$Δνɛ=∑i=0∞δνɛ(i)<∞,$

where $δνɛ(i)={E(∣ɛi-g(ηi,…,η1,η0*,η-1,…)∣ν)}1/ν$ and {$ηt*$, t ∈ ℤ} is an iid copy of {ηt, t ∈ ℤ}.

The error process satisfying conditions (E2)–(E5) includes iid, ARCH and GARCH processes. For example, if the errors are are iid with E(|ɛ1|ν) < ∞, then $Δνɛ=δνɛ(0)≤2{E(∣ɛ1∣ν)}1/ν<∞$ and E(ɛt|ɛt–1, ɛt–2, . . .) = 0, and consequently (E2)–(E5) hold. From Bollerslev (1986) and Wu (2011), GARCH processes satisfy conditions (E2)–(E5) also. Let γ : ℤ → ℝ be the autocovariance function of the process {yt, t ∈ ℤ}. From (E1)–(E3), $γ(0)=∑i=0∞ψi2σɛ2∈(0,∞)$ and $γ(h)=∑i=0∞ψiψi+hσɛ2$ converges to 0 as h increases to ∞, where ψ0 = 1. Therefore, the autocovariance matrix Γp = γ(ij))1≤i, jp is positive definite by Proposition 5.1.1 of Brockwell and Davis (2006). Also, γ is an absolutely summable autocovariance function because of the absolute summability of {ψi, i ∈ ℕ}.

We consider a class of non-convex penalties where the penalty Jλ satisfies (P1)–(P3):

• (P1) There exists a decreasing function ∇Jλ : [0,∞) → [0,∞) such that $Jλ(x)=∫0x∇Jλ(t)dt$ for all x ≥ 0.

• (P2) There exists a positive constant a such that ∇Jλ(x) = 0 for all x > .

• (P3) λx/a ≤ ∇Jλ(x) ≤ λ for all 0 ≤ x < .

The penalty class has been studied in high dimensional linear regression model (Kim and Kwon, 2012; Zhang and Zhang, 2012; Kim et al., 2016). The class includes the MC and capped ℓ1 penalties (Zhang and Zhang, 2012; Shen et al., 2013) as special examples that are upper and lower bounds of the class that also include the SCAD penalty. (P1) implies Jλ is a continuous, increasing and concave function defined on [0,∞) and Jλ(0) = 0. If Jλ is differentiable, then ∇Jλ is merely the derivative function of Jλ. From conditions (P1)–(P3), we can see that Jλ has locally sub-differentiable at a point x0 ∈ (−∞, −) ∪ {0} ∪ (,∞) although it is not convex. By (P2), 0 is the unique local subgradient of Jλ at a point x0 when |x0| > , which makes the non-zero elements of the penalized estimator to be unbiased with finite samples. It is easy to see that ∇Jλ(x) ≥ λ/2 for 0 ≤ x/2 from (P3) so that the set of local subgradients of Jλ at the origin includes [−λ/2, λ/2], which makes the penalized estimator to be sparse. These properties of the subgradients play an important role in constructing the oracle properties.

3.2. Oracle properties

Theorem 1

(Weak oracle property) Assume that (E1)–(E5) with ν ≥ 4 and (P1)–(P3) hold. Let, whereis a q× q submatrix of Γp and ||A||is the maximum absolute row sum of a matrixA. If

$limn→∞(1+κ2) p2n=0, limn→∞log p+κ2 log qnλ2=0, limn→∞λminj∈ST∣βj∣=0$

andmsatisfies

$max0≤j≤p∣mj-μ∣ =OP (1n),$

then the oracle least squares estimator (LSE)

$β^o=(β^1o,…,β^po)T=arg minb∈ℝop Q(b;m)$

is a local minimizer of $LλG(b;m)$ with probability tending to 1, where $ℝop={(x1,…,xp)T∈ℝp:xj=0,j∉ST}$.

Theorem 1 shows that the oracle LSE in (3.5) becomes one of local minimizers of (2.6), which often referred as the weak oracle property (Fan and Li, 2001; Kim et al., 2016) in linear regression model. Note that the result holds for any mj, that is $n$-consistent estimator of μ. For example, we may use the trimmed mean instead of the sample mean when there are some outliers.

The objective function (2.6) is non-convex so that we need a stronger result than the weak oracle property to avoid bad local minimizers (Zhang, 2010a; Kim and Kwon, 2012). The next theorem shows that the oracle LSE is unique so that it becomes unique global minimizer of (2.6).

Theorem 2

(Strong oracle property) Assume that the assumptions of Theorem 1 hold. Let ρ = λminp) where λmin(A) is the smallest eigenvalue of a matrixA. If

$limn→∞p2nρ2=0, limn→∞log p+κ2 log qnρ2λ2=0, limn→∞λρ minj∈ST∣βj∣=0,$

then β̂o is the unique local minimizer of $LλG(b;m)$with probability tending to 1.

Remark 1

Note that κ and ρ often assumed to be fixed positive constants. In this case, the results in Theorems 1 and 2 hold when

$minj∈ST∣βj∣ ≫λ≫logpn and n≫p2,$

which are exactly the same as the results in linear regression model. Here, ab implies b/a = o(1) as n→∞.

Remark 2

In the linear regression ρ = λmin(XTX/n) determines the size of possible minimum non-zero regression coefficient, where X is the design matrix since we need $minj∣βj∣ ≥λ≥log p/n$ and $minj∣βj∣ ≥λ/ρ≥log p/nρ4$ for the weak and strong oracle properties, respectively. This implies that the smaller ρ is the larger is required. In the AR process ρ = λminp) and κ = ||Γp(S T )−1|| take the same role since we need $minj∣βj∣ ≥ λ≥(log p+κ2 log q)/n$ and $minj∣βj∣ ≥λ/ρ≥(log p+κ2 log q)/nρ4$.

Let β̂o,B and β̂o,C be the oracle LSEs that correspond to m = (0, . . . , p)T and m = (, . . . , )T, respectively. By the functional central limit theorem,

$max0≤j≤p∣y¯j-μ∣ ≤2n-pmax1≤k≤n|∑t=1k(yt-μ)|=OP (1n)$

and $|y¯-μ|=OP(1/n)$. Hence, from Theorem 2, we can see that two penalized estimators are exactly the same as the oracle LSEs, respectively, which is summarized in the next corollary.

Corollary 1

Assume that the assumptions of Theorem 2 hold then

$limn→∞ P (β^λ,B=β^o,B)=1 and limn→∞ P (β^λ,C=β^o,C)=1.$
Remark 3

From Lemma 2 in Appendix,

$maxj∈SF|β^jo,B-βj|=OP (κlog qn) and maxj∈SF|β^jo,C-βj|=OP (κlog qn),$

which implies

$maxj∈SF|β^jλ,B-β^jλ,C|=OP (κlog qn).$

Hence two oracle LSEs are asymptotically equivalent so that centering the samples does not affect regression parameter estimation.

3.3. Tuning parameter selection

The most practical and important issue for the penalized estimation is how to select tuning parameters (Nardi and Rinaldo, 2011; Kwon et al., 2017). Some conventional ways can be applied by minimizing validation or cross-validation errors. However, selecting tuning parameters based on prediction errors may produce over-fitted sub-models (Wang et al., 2007, 2009). We propose to use the GIC, which is easy to calculate without using extra independent samples, in order to select the tuning parameters. Given λ > 0, define the GIC as

$GIC(λ)=log Q(β^λ;m)+α‖β^λ‖0,$

where ||β̂λ||0 is the number of non-zero entries of β̂λ. The next theorem proves that we can select optimal tuning parameter by minimizing the GIC.

Theorem 3

Assume that (E1)–(E5) with ν ≥ 4, and (P1)–(P3) hold. If

$limn→∞(1+κ2+ρ-2) p2n=0,$$limn→∞log p+κlog q+ρpρ2n minj∈ST∣βj∣=0,$$limn→∞ pα=0, limn→∞pαρ minj∈ST βj2=0, limn→∞log p+κ2 log qραn=0,$

then there exists λ0such that

$limn→∞ P (infλ∈Λ+∪Λ-GIC(λ)>GIC(λ0))=1,$

whereand $Sλ={j∈SF:β^jλ≠0}$for λ > 0.

Remark 4

When κ and ρ are positive constants the result in Theorem 3 holds when min

$minj∈ST∣βj∣ ≫pα≫p log pn and n≫p2.$

For example, α = log(log p) log n/n satisfies the condition as suggested in Kwon et al. (2017) and Na (2017) for the adaptive LASSO and GIC, respectively.

4. Numerical studies

We consider two examples to show that the theoretical results hold with finite samples:

Example 1

, $βj*=(c0/j)I(j∈ST)$, μ = 0 and ɛt is iid samples from standard normal distribution, where $c0=0.9/(∑j=1q1/j)$.

Example 2

is a random subset of and $βj*=∑k=1q(c0/k)I(j=jk)$.

In each example, we set n = 100 × 2k, k ∈ {0, 1, . . . , 7}, p = 10[n1/3] and q = [n1/4], where [x] is the closest integer from x.

We consider two non-convex penalties for the centered samples, the SCAD and MCP, and compare finite sample performance to some reference methods: four GICs (AIC, HQC, BIC, and GIC6) in Na (2017) and the adaptive LASSO (ALASSO) in Kwon et al. (2017). Tuning parameters are selected by the GIC in (3.8) with α = log(log p) log n/n for all the penalized estimators. We report four measures: sensitivity (SEN), specificity (SPC), selection accuracy (SA), and prediction error (PE) from independent test samples $yit$, in. The measures are defined as , $∣S^Tc∩STc∣/∣STc∣$, , and $∑i=1n(yit-y^it)/n$ respectively, where is the index set of non-zero parameters estimated from the methods. We repeat each simulation 200 times and summarize the results in Tables 1, 2 and Figure 1, where all the standard errors are less than 0.03 and omitted.

The PEs are quite similar for all the methods but selection performance are significantly different. The AIC and HQC have better SEN but worse SPC than the others, which result in a very low SA. This shows that the AIC and HQC are not selection consistent (Na, 2017) but over-fit. The SEN, SPC, and SA are increasing to 1 as the sample sizes increases for all the penalized estimators as well as the BIC and GIC6. The simulation results confirm that the theoretical properties provided hold for finite samples.

5. Concluding remarks

We present some asymptotic properties of the non-convex penalized estimators for the AR process, when the maximal order is large and minimal signal size is small. The results show that the non-convex penalized estimation can be used for parameter estimation and model identification simultaneously. This paper is intended to provide a theoretical starting point for future studies on other complex time series analysis.

A referee pointed out that assuming increasing p is unusual in real practice because the AR order does not necessarily increase with the sample size. First of all, we completely agree that many or almost all the AR process may have fixed or finite model orders so that p may not be assumed to be increasing or varying, being independent of the sample size. However we think that the larger the sample size the more parameters become statistically significant, which implies that a fixed choice of small p may prevent us from finding important lags. In this sense, the expression “p→∞as n→∞” does not imply the existence of such an order-increasing model but implies that we must try an order as large as possible, considering the sample size. Besides p→∞, the expression “minj |βj| → 0” can be understood in a similar way.

Acknowledgements

This work was supported by a Kyonggi University Research Grant 2015-099.

Appendix

Let xi j = yij+pmj and zi j = yij+pμ for i = 1, . . . , np and j = 1, . . . , p. Let X = (xi j), Z = (zi j), y = (yp+1, . . . , yn)T, and ɛ = (ɛp+1, . . . , ɛn)T. For j = 1, . . . , p, let Xj and Zj be the jth column vectors of X and Z, respectively. For a set with 1 ≤ i1 < i2 < . . . < ikp, denotes the (np) × k matrix with jth column vector Xij and is defined similarly. Given a p-dimensional vector v = (v1, . . . , vp)T and a set denotes a k-dimensional subvector (vi1, . . . , vik )T of v. For a matrix A, ||A||2 and ||A|| are the induced matrix norms using the Euclidean and maximum norms for vectors, respectively. And for a set means the cardinality of .

Lemma 1

Assume that (2.1), (3.1), (E1)–(E5) with ν ≥ 4, and (3.4) hold. If limn→∞p2/n = 0, then

$maxj∈ST|XjTɛ|=OP (n log q),$$maxj∈SF|XjTɛ|=OP (n log p),$$maxj∈SF|XjTu|=OP (n),$$‖XTXn-p-Γp‖∞=OP (pn),$

whereuis a (np)-dimensional vector with all entries 1.

Proof

First, we can show that there exists a positive constant M such that

$limn→∞ P (maxj∈S|Zjɛ|>Mn log∣S∣)=0$

for all non-empty subset in a similar method to the proof of Lemma 1 in Kwon et al. (2017), because {zi j} is a stationary predictable process and {ɛi} is a sequence of stationary martingale differences with respect to . Since XjZj = (μ − mj)u for all and $uTɛ=OP(n)$, (3.4) implies that

$sup∅≠S⊂SF|maxj∈S|XjTɛ|-maxj∈S|ZjTɛ||≤maxj∈SF|(Xj-Zj)Tɛ|≤maxj∈SF|mj-μ|×|uTɛ|=OP(1).$

Therefore, for a non-empty subset

$maxj∈S|XjTɛ|=maxj∈S|ZjTɛ|+OP(1)=OP (n log∣S∣)$

and hence (A.1) and (A.2) hold.

Next, note that {yt, t ∈ ℤ} is ν-strong stable under assumptions (E1)–(E5). Applying the functional central limit theorem to {yt, t ∈ ℤ} yields

$maxj∈SF|ZjTu|=OP (n)$

and hence

$maxj∈SF|XjTu|=maxj∈SF|ZjTu-(mj-μ) uTu|≤maxj∈SF|ZjTu|+(n-p)×maxj∈SF|mj-μ|=OP (n)$

by (3.4).

Lastly, we can extend Lemma 1 in Kwon et al. (2017) to the case of Z and obtain

$‖ZTZn-p-Γp‖∞=OP (pn)$

due to the ν-strong stability of {yt, t ∈ ℤ}. Also, for any i,

$|XiTXj-ZiTZj|=|(μ-mi)(μ-mj)uTu+(μ-mi)uTZj+(μ-mj)uTZi|≤(n-p)×(maxj∈SF|mj-μ|)2+2maxj∈SF|mj-μ|×maxj∈SF|uTZj|.$

Therefore, from (A.5)–(A.8) and (3.4), we can deduce that

$‖XTXn-p-Γp‖∞≤‖ZTZn-p-Γp‖∞+max1≤i≤p∑j=1p|XiTXj-ZiTZj|n-p=OP (pn).$

Lemma 2

Assume that the assumptions of Lemma 1 hold and that limn→∞κ2p2/n = 0. Then,

$maxj∈SF|β^jo-βj|=OP (κlog qn).$
Proof

First, $β^STo$ can be written as follows

$β^STo=(XSTTXST)-1XSTT(y-m0u)=βST+(XSTTXST)-1XSTTɛ+(XSTTXST)-1XSTTu×{∑j∈ST(mj-μ)βj-(m0-μ)},$

provided that $XSTTXST$ is non-singular. Therefore,

$maxj∈SF|β^jo-βj|=maxj∈ST|β^jo-βj|≤‖(XSTTXST)-1‖∞×{maxj∈ST|XjTɛ|+maxj∈SF|XjTu|×max0≤j≤p|mj-μ|×(1+∑j∈ST|βj|)}.$

If satisfies that

$‖XSTTXSTn-p-Γp(ST)‖∞×‖Γp(ST)-1‖∞<12,$

then $XSTTXST/(n-p)$ is non-singular and

$‖(XSTTXSTn-p)-1‖∞≤ ‖Γp(ST)-1‖∞+‖(XSTTXSTn-p)-1-Γp(ST)-1‖∞≤ ‖Γp(ST)-1‖∞+‖XSTTXST/(n-p)-Γp(ST)‖∞‖Γp(ST)-1‖∞21-‖XSTTXST/(n-p)-Γp(ST)‖∞×‖Γp(ST)-1‖∞≤ 2‖Γp(ST)-1‖∞.$

Since limn→∞κ2p2/n = 0, the probability that (A.10) holds tends to 1 as n → ∞ by (A.4). Consequently,

$‖(XSTTXST)-1‖∞=OP (κn).$

According to (A.1), (A.3), and (3.4),

$maxj∈ST|XjTɛ|+maxj∈SF|XjTu|×max0≤j≤p|mj-μ|×(1+∑j∈ST∣βj∣)=OP (n log q)+OP(q)$

and this completes the proof.

Now, let us prove the main theorems in Section 3.

Proof of Theorem 1

Since $β^STo$ satisfies the normal equations $XSTTXSTβ^STo=XSTT(y-m0u)$ and $β^SF∩STc=0$, we have

$XST (y-m0u-Xβ^o)=0.$

Therefore, if $minj∈ST∣β^jo∣ >aλ$ and $maxj∉ST∣XjT(y-m0u-Xβ^o)∣ ≤nλ/2$, then β̂o becomes a local minimizer of $LλG(b;m)$ by the KKT conditions and (P1)–(P3).

From (3.3) and Lemma 2, we have

$limn→∞ P (minj∈ST|β^jo|>aλ)≥limn→∞ P (minj∈ST|βj|-maxj∈SF|β^jo-βj|>aλ)=1.$

By using the results of Lemmas 1 and 2 and the boundedness of ||Γp||, we obtain that

$maxj∉ST|XjT (y-m0u-Xβ^o)|≤maxj∉ST|XjTX (β-β^o)|+maxj∉ST|XjTɛ|+maxj∉ST|XjT{(Z-X)β+(μ-m0)u}|≤‖XTX‖∞maxj∈SF|β^jo-βj|+maxj∈SF|XjTɛ|+maxj∈SF|XjTu|max0≤j≤p|mj-u|(1+∑j∈ST|βj|)=OP (κn log q)+OP (n log p)+OP(q)$

and hence (3.3) implies that

$limn→∞ P (maxj∉ST|XjT (y-m0-Xβ^o)|≤nλ2)=1.$

Therefore

$limn→∞ P (β^o∈Aλ)≥limn→∞ P (minj∈ST|β^jo|>aλ)+limn→∞ P (maxj∉ST|XjT (y-m0-Xβ^o)|≤nλ2)-1=1,$

where is the set of all local minimizers of $LλG(b;m)$.

Proof of Theorem 2

For given λ > 0 and c > 0, let be the set of b ∈ ℝp such that

$maxbj=0|XjT(y-m0u-Xb)|n-p

Let ρ̂ be the smallest eigenvalue of XTX/(np). By Theorem 1 of Kim and Kwon (2012), with ρ̂ > 0 is a sufficient condition to be a unique local minimizer of $LλG(b;m)$. Also, since = 1 by Theorem 1, it is enough to show

$limn→∞ P (β^o∈Uρ^,ρ^>0)=1$

in order to obtain the result of Theorem 2.

Using the condition limn→∞p2/(2) = 1, (A.4) and the symmetry of XTX/(np) – Γp,

$limn→∞ P (‖XTXn-p-Γp‖2>ρ2)≤limn→∞ P (‖XTXn-p-Γp‖∞>ρ2)=0.$

And if ||XTX/(np) – Γp||2ρ/2, then ρ̂ρ/2 > 0 and hence . Therefore, we have

$limn→∞ P (β^o∈Uρ^,ρ^>0)≥limn→∞ P (β^o∈Uρ^,ρ^>0,‖XTXn-p-Γp‖2≤ρ2)≥limn→∞ P (β^o∈Uρ2,‖XTXn-p-Γp‖2≤ρ2)≥limn→∞ P (β^o∈Uρ2).$

Note that $limn→∞ P(minj∈ST∣βj∣ >2 maxj∈SF∣β^jo-βj∣)=1$ by (3.3) and Lemma 2. In this case, and hence ${j∈SF:β^jo≠0}=ST$. Since inf0<x<(ρx/2 + ∇Jλ(x)) ≥ min(λ, aρλ/2) and sup0<x<(x + 2∇Jλ(x)/ρ) ≤ + 2λ/ρ by (P3), we have

$limn→∞ P (β^o∈Uρ2)≥limn→∞ P (maxj∉ST|XjT (y-m0u-Xβ^o)|n-paλ+2λρ,minj∈ST|βj|>2maxj∈SF|β^jo-βj|)-1=1$

by (3.6) and (A.11). Hence, (A.12) holds.

Lemma 3

For a given, let $β˜S=argminb∈ℝSp Q(b;m)$and define

$GIC˜(S)=log Q(β˜S;m)+α∣S∣,$

where $ℝSp={(x1,…,xp)T∈ℝp:xj=0,j∉S}$. Under the assumptions of Theorem 3,

$limn→∞ P (infS≠STGIC˜(S)>GIC˜(ST))=1.$
Proof

First, note that ρ is positive, ||Γp||2 is bounded, and

$limn→∞ P (ρ2≤ρ^≤‖XTXn-p‖2≤2‖Γp‖2)≥limn→∞ P (‖XTXn-p-Γp‖≤ρ2)=1$

by (A.13). Given a set , we can show that $‖ZSTɛ‖2=OP(n∣S∣)$ by extending Lemma 1 of Na (2017). Since $maxj∈SF∣XjTɛ-ZjTɛ∣=maxj∈SF∣mj-μ∣ ∣uTɛ∣=OP(1)$,

$‖XSTɛ‖2≤‖ZSTɛ‖2+∣S∣maxj∈SF|XjTɛ-ZjTɛ|=OP (n∣S∣).$

From (A.6), we have

$‖XSTu‖2≤∣S∣maxj∈SF|XjTu|=OP (n∣S∣).$

Therefore, combining (A.15)–(A.17) and (3.4) gives that

$‖β˜SF-β‖2I(ρ^≥ρ2)≤‖(XTX)-1‖2{‖XTɛ‖2+‖XTu‖2 max0≤j≤p∣mj-μ∣(1+∑j∈ST|βj|)}=OP (pnρ2)$

and consequently, we have

$‖β˜SF-β‖2=OP (pnρ2) and Q(β˜SF;m)n-p=σɛ2+oP(1).$

For a while, we assume that , ρ̂ρ/2, and $Q(β˜SF;m)≥(n-p)σɛ2/2$. Then for any ,

$log Q(β˜S;m)-log Q(β˜SF;m)≥min {(Q(β˜S;m)-Q(β˜SF;m))(2Q(β˜SF;m)),12}≥min {ρ^‖β˜SF-β˜S‖22σɛ2,12}≥min {ρ^ minj∈ST|β˜jSF|2σɛ2,12}≥min {ρ minj∈SF∣βj∣24σɛ2,12},$

because log(1 + x) ≥ min(x/2, 1/2) for all x > 0. Therefore, we obtain that

$limn→∞ P (infS⊋STGIC˜(S)>GIC˜(SF))≥limn→∞ P (min {ρ minj∈SF∣βj∣24σɛ2,12}>pα)=1$

by using (3.10), (3.11), (A.15), and (A.18).

Since log(1 + x) ≤ x for all x > 1 and for all , we have

$log Q(β˜ST;m)-log Q(β˜S;m)≤(Q(β˜ST;m)-Q(β˜S;m))Q(β˜S;m)≤‖(XTX)-1‖2‖XS∩STcT (In-p-XST(XSTTXST)-1XSTT) (y-m0u)‖22Q(β˜ST;m)$

for all set , where Inp is a (np)-dimensional identity matrix. Also,

$supS⊋ST‖XS∩STcT (In-p-XST(XSTTXST)-1XSTT)(y-m0u)‖2/|S∩STc|≤supS⊋ST‖XS∩STcT (In-p-XST(XSTTXST)-1XSTT)(y-m0u-XSTβST)‖∞≤ ‖XTɛ‖∞+‖XTXST‖∞‖(XSTTXST)-1‖∞‖XSTTɛ‖∞+(‖XTu‖∞+‖XTXST‖∞‖(XSTTXST)-1‖∞‖XSTTu‖∞)max0≤j≤p∣mj-μ∣(1+∑j∈ST∣βj∣)=OP (n log p)+OP (κn log q)+OP(qκ).$

Hence, (3.11), (A.15), and (A.18) imply that

$limn→∞ P (infS⊋STGIC˜(S)-GIC˜(ST)|S∩STc|>0)=1.$

Therefore, we obtain the result (A.14) by (A.19) and (A.20).

Proof of Theorem 3

By Theorems 1 and 2, there exists a sequence λ0 such that

$limn→∞ P (β^λ0=β^o,Sλ0=ST)=1.$

Since $GIC(λ)≥GIC˜(Sλ)$ for all λ > 0 and $GIC(λ0)=GIC˜(ST)$ when β̂λ0= β̂o and

$limn→∞ P (infλ∈Λ+∪Λ-GIC(λ)>GIC(λ0))≥limn→∞ P (infλ∈Λ+∪Λ-GIC(λ)>GIC(λ0),β^λ0=β^o,Sλ0=ST)≥limn→∞ P (infλ∈Λ+∪Λ-GIC˜(Sλ)>GIC˜(ST),β^λ0=β^o,Sλ0=ST)≥limn→∞ P (infS≠STGIC˜(S)>GIC˜(ST))-limn→∞ P (β^λ0≠β^o or Sλ0≠ST)=1$

by Lemma 3.

Figures
Fig. 1. Sensitivity, specificity, selection accuracy and prediction error in Example 1 (upper) and 2 (lower). AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty.
TABLES

Table 1

Simulation results from Example 1

n p AIC HQC BIC GIC6 ALASSO SCAD MCP
SEN 100 50 0.850 0.770 0.700 0.630 0.650 0.650 0.650
200 60 0.898 0.805 0.715 0.655 0.670 0.668 0.668
400 70 0.978 0.952 0.898 0.850 0.868 0.878 0.860
800 90 0.994 0.972 0.934 0.872 0.908 0.908 0.908
1600 120 0.998 0.992 0.973 0.925 0.955 0.955 0.955
3200 150 1.000 0.995 0.974 0.925 0.971 0.971 0.970
6400 190 1.000 1.000 0.996 0.971 0.993 0.993 0.993
12800 230 1.000 1.000 1.000 0.992 0.997 0.997 0.997

SPC 100 50 0.784 0.909 0.968 0.986 0.982 0.982 0.982
200 60 0.819 0.941 0.983 0.995 0.990 0.990 0.990
400 70 0.837 0.947 0.992 0.998 0.996 0.996 0.996
800 90 0.838 0.955 0.994 0.999 0.997 0.996 0.997
1600 120 0.828 0.959 0.997 0.999 0.998 0.998 0.998
3200 150 0.844 0.967 0.998 1.000 0.999 0.999 0.999
6400 190 0.840 0.967 0.998 1.000 1.000 1.000 0.999
12800 230 0.869 0.972 0.998 1.000 1.000 1.000 1.000

SA 100 50 0.000 0.040 0.070 0.050 0.080 0.080 0.080
200 60 0.000 0.030 0.070 0.070 0.060 0.060 0.060
400 70 0.000 0.120 0.400 0.370 0.420 0.430 0.410
800 90 0.000 0.110 0.510 0.390 0.460 0.470 0.470
1600 120 0.000 0.060 0.620 0.490 0.620 0.620 0.630
3200 150 0.000 0.080 0.600 0.390 0.690 0.700 0.690
6400 190 0.000 0.050 0.660 0.740 0.860 0.860 0.850
12800 230 0.000 0.020 0.730 0.890 0.900 0.900 0.900

PE 100 50 1.591 1.388 1.312 1.201 1.260 1.265 1.267
200 60 1.198 1.126 1.098 1.093 1.122 1.123 1.120
400 70 1.116 1.074 1.047 1.044 1.056 1.054 1.056
800 90 1.067 1.034 1.018 1.020 1.026 1.027 1.026
1600 120 1.038 1.015 1.002 1.004 1.007 1.007 1.007
3200 150 1.024 1.009 1.002 1.004 1.005 1.005 1.005
6400 190 1.018 1.008 1.003 1.004 1.004 1.004 1.004
12800 230 1.009 1.003 1.000 1.000 1.001 1.001 1.001

AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty; SEN = sensitivity; SPC = specificity; SA = selection accuracy; PE = prediction error.

Table 2

Simulation results from Example 2

n p AIC HQC BIC GIC6 ALASSO SCAD MCP
SEN 100 50 0.877 0.830 0.747 0.663 0.577 0.563 0.563
200 60 0.910 0.835 0.718 0.655 0.602 0.598 0.598
400 70 0.982 0.970 0.940 0.902 0.862 0.865 0.865
800 90 0.996 0.986 0.966 0.920 0.928 0.924 0.918
1600 120 0.998 0.993 0.978 0.948 0.967 0.968 0.967
3200 150 1.000 1.000 0.991 0.958 0.976 0.978 0.976
6400 190 1.000 1.000 1.000 0.990 0.997 0.997 0.997
12800 230 1.000 1.000 1.000 0.993 0.998 0.998 0.998

SPC 100 50 0.800 0.897 0.961 0.980 0.985 0.986 0.985
200 60 0.831 0.943 0.976 0.990 0.989 0.989 0.989
400 70 0.850 0.946 0.984 0.997 0.993 0.992 0.992
800 90 0.844 0.960 0.993 0.998 0.995 0.995 0.995
1600 120 0.834 0.957 0.996 1.000 0.998 0.998 0.998
3200 150 0.848 0.967 0.996 1.000 0.998 0.998 0.998
6400 190 0.843 0.964 0.997 1.000 0.999 0.999 0.999
12800 230 0.853 0.969 0.998 1.000 0.999 0.999 0.999

SA 100 50 0.000 0.010 0.070 0.130 0.130 0.130 0.130
200 60 0.000 0.060 0.080 0.060 0.070 0.070 0.070
400 70 0.000 0.070 0.310 0.550 0.330 0.310 0.310
800 90 0.000 0.070 0.540 0.560 0.520 0.510 0.490
1600 120 0.000 0.020 0.560 0.650 0.610 0.620 0.620
3200 150 0.000 0.010 0.590 0.680 0.600 0.610 0.600
6400 190 0.000 0.020 0.560 0.880 0.780 0.780 0.780
12800 230 0.000 0.000 0.690 0.910 0.820 0.800 0.830

PE 100 50 1.402 1.282 1.210 1.191 1.210 1.218 1.214
200 60 1.211 1.128 1.110 1.106 1.133 1.134 1.132
400 70 1.096 1.055 1.034 1.026 1.050 1.048 1.047
800 90 1.070 1.034 1.019 1.021 1.032 1.033 1.032
1600 120 1.045 1.020 1.007 1.006 1.013 1.013 1.012
3200 150 1.028 1.013 1.006 1.006 1.010 1.010 1.010
6400 190 1.019 1.010 1.004 1.004 1.006 1.006 1.006
12800 230 1.010 1.004 1.001 1.001 1.002 1.002 1.002

AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty; SEN = sensitivity; SPC = specificity; SA = selection accuracy; PE = prediction error.

References
1. Akaike, H (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics. 21, 243-247.
2. Akaike, H 1973. Information theory and an extension of the maximum likelihood principle., Proceeding 2nd International Symposium on Information Theory, pp.267-281.
3. Akaike, H (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika. 66, 237-242.
4. Bollerslev, T (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics. 31, 307-327.
5. Brockwell, PJ, and Davis, RA (2006). Time Series: Theory and Methods. New York: Springer
6. Chen, C (1999). Subset selection of autoregressive time series models. Journal of Forecasting. 18, 505-516.
7. Claeskens, G, Croux, C, and Van Kerckhoven, J (2007). Prediction focused model selection for autoregressive models. The Australian and New Zealand Journal of Statistics. 49, 359-379.
8. Claeskens, G, and Hjort, NL (2003). The focussed information criterion. Journal of the American Statistical Association. 98, 900-916.
9. Fan, J, and Li, R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 96, 1348-1360.
10. Fan, J, and Peng, H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 32, 928-961.
11. Friedman, J, Hastie, T, Hofling, H, and Tibshirani, R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics. 1, 302-332.
12. Hannan, EJ (1980). The estimation of the order of an ARMA process. The Annals of Statistics. 8, 1071-1081.
13. Hannan, EJ, and Quinn, BG (1979). The determination of the order of an autoregression. Journal of Royal Statistical Society. 41, 190-195.
14. Huang, J, Horowitz, JL, and Ma, S (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics. 36, 587-613.
15. Kim, Y, Choi, H, and Oh, H (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 103, 1656-1673.
16. Kim, Y, Jeon, JJ, and Han, S (2016). A necessary condition for the strong oracle property. Scandinavian Journal of Statistics. 43, 610-624.
17. Kim, Y, and Kwon, S (2012). Global optimality of nonconvex penalized estimators. Biometrika. 99, 315-325.
18. Kim, Y, Kwon, S, and Choi, H (2012). Consistent model selection criteria on high dimensions. Journal of Machine Learning Research. 13, 1037-1057.
19. Kwon, S, and Kim, Y (2012). Large sample properties of the SCAD-penalized maximum likelihood estimation on high dimensions. Statistica Sinica. 22, 629-653.
20. Kwon, S, Lee, S, and Na, O (2017). Tuning parameter selection for the adaptive lasso in the autoregressive model. Journal of the Korean Statistical Society. 46, 285-297.
21. Kwon, S, Oh, S, and Lee, Y (2016). The use of random-effect models for high-dimensional variable selection problems. Computational Statistics & Data Analysis. 103, 401-412.
22. Lee, S, Kwon, S, and Kim, Y (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics & Data Analysis. 94, 275-286.
23. McClave, J (1975). Subset autoregression. Technometrics. 17, 213-220.
24. McLeod, AI, and Zhang, Y (2006). Partial autocorrelation parametrization for subset autoregression. Journal of Time Series Analysis. 27, 599-612.
25. Na, O (2017). Generalized information criterion for the ar model. Journal of the Korean Statistical Society. 46, 146-160.
26. Nardi, Y, and Rinaldo, A (2011). Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis. 102, 528-549.
27. Sang, H, and Sun, Y (2015). Simultaneous sparse model selection and coefficient estimation for heavy-tailed autoregressive processes. Statistics. 49, 187-208.
28. Sarkar, A, and Kanjilal, PP (1995). On a method of identification of best subset model from full ar-model. Communications in Statistics-Theory and Methods. 24, 1551-1567.
29. Schmidt, DF, and Makalic, E (2013). Estimation of stationary autoregressive models with the Bayesian LASSO. Journal of Time Series Analysis. 34, 517-531.
30. Schwarz, G (1978). Estimating the dimension of a model. The Annals of Statistics. 6, 461-464.
31. Shen, X, Pan, W, Zhu, Y, and Zhou, H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics. 1, 1-26.
32. Shibata, R (1976). Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika. 63, 117-126.
33. Tibshirani, RJ (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B. 58, 267-288.
34. Tsay, RS (1984). Order selection in nonstationary autoregressive models. The Annals of Statistics. 12, 1425-1433.
35. Wang, H, Li, B, and Leng, C (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 71, 671-683.
36. Wang, H, Li, R, and Tsai, C (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553-568.
37. Wu, WB (2005). Nonlinear system theory: another look at dependence. Proceedings of the National Academy of Sciences of the United States of America. 102, 14150-14154.
38. Wu, WB (2011). Asymptotic theory for stationary processes. Statistics and Its Interface. 4, 207-226.
39. Ye, F, and Zhang, CH (2010). Rate Minimaxity of the Lasso and Dantzig selector for the lq loss in lr balls. Journal of Machine Learning Research. 11, 3519-3540.
40. Zhang, CH (2010a). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 38, 894-942.
41. Zhang, CH, and Zhang, T (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 27, 576-593.
42. Zhang, T (2010b). Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 11, 1081-1107.
43. Zou, H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 101, 1418-1429.
44. Zou, H, and Li, R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 36, 1509-1533.