TEXT SIZE

search for



CrossRef (0)
Non-convex penalized estimation for the AR process
Communications for Statistical Applications and Methods 2018;25:453-470
Published online September 30, 2018
© 2018 Korean Statistical Society.

Okyoung Naa, and Sunghoon Kwon1,b

aDepartment of Applied Statistics, Kyonggi University, Korea, bDepartment of Applied Statistics, Konkuk University, Korea
Correspondence to: 1Department of Applied Statistics, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Korea. E-mail: shkwon0522@konkuk.ac.kr
Received January 22, 2018; Revised April 3, 2018; Accepted April 4, 2018.
 Abstract

We study how to distinguish the parameters of the sparse autoregressive (AR) process from zero using a non-convex penalized estimation. A class of non-convex penalties are considered that include the smoothly clipped absolute deviation and minimax concave penalties as special examples. We prove that the penalized estimators achieve some standard theoretical properties such as weak and strong oracle properties which have been proved in sparse linear regression framework. The results hold when the maximal order of the AR process increases to infinity and the minimal size of true non-zero parameters decreases toward zero as the sample size increases. Further, we construct a practical method to select tuning parameters using generalized information criterion, of which the minimizer asymptotically recovers the best theoretical non-penalized estimator of the sparse AR process. Simulation studies are given to confirm the theoretical results.

Keywords : autoregressive process, subset selection, non-convex penalty, oracle property, tuning parameter selection
1. Introduction

The autoregressive (AR) process is a basic and important processes for time series data analysis. The usual least square estimation may yield severe modeling biases when the AR model is sparse including zero parameters. Various information criteria (Akaike, 1969, 1973, 1979; Schwarz, 1978; Hannan and Quinn, 1979; Claeskens and Hjort, 2003) have been proposed to identify true non-zero parameters of the AR process, which we call subset selection problem in this paper. Theoretical properties such as asymptotic efficiency and selection consistency of the final sub-process from these information criteria have also been investigated (Shibata, 1976; Hannan and Quinn, 1979; Tsay, 1984; Claeskens and Hjort, 2003; Claeskens et al., 2007). Recently, Na (2017) introduced the generalized information criterion (GIC) (Kim et al., 2012) for the AR process that includes most of information criteria such as Akaike information criterion (AIC) (Akaike, 1973), Hannan-Quinn criterion (HQC) (Hannan, 1980) and Bayesian information criterion (BIC) (Schwarz, 1978). Na (2017) proved that there is a large class of GICs that is selection consistent, including the BIC as an example. However, these approaches suffer from computational complexity since it is almost impossible to compare all the candidate sub-processes when the maximal order is very large (McClave, 1975; Sarkar and Kanjilal, 1995; Chen, 1999; McLeod and Zhang, 2006).

For years, penalized estimation has been studied as an alternative for the subset selection problem (Nardi and Rinaldo, 2011; Schmidt and Makalic, 2013; Sang and Sun, 2015; Kwon et al., 2017). The penalized estimation has nice asymptotic properties such as selection consistency and minimax optimality for various statistical models that include the generalized linear regression model (Fan and Peng, 2004; Zou, 2006; Ye and Zhang, 2010; Kwon and Kim, 2012). However, the advantage of the penalized estimation comes from the efficiency of the computation since there exist many fast and efficient algorithms (Friedman et al., 2007; Kim et al., 2008; Lee et al., 2016). Hence we need not to exhaustively search all the possible candidate sub-models when the AR process has very large model order.

There are many possible penalty functions for the penalized estimation such as the least absolute selection and shrinkage operator (LASSO) (Tibshirani, 1996) and smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001). There are some non-convex penalties including the SCAD that have very distinct advantages against others such as the bridge (Huang et al., 2008; Kim et al., 2016) and log (Zou and Li, 2008; Kwon et al., 2016). First, they produce unbiased estimators of the parameters that help us understand the final model without considering the penalty effect. Second, they exactly select the non-zero parameters with probability tending to one which is impossible for other penalty functions such as the LASSO (Zhang, 2010b; Kim and Kwon, 2012; Zhang and Zhang, 2012) and ridge.

In this paper, we study the subset selection problem by using the non-convex penalized estimation used to identify non-zero parameters in various statistical models (Fan and Li, 2001; Zhang, 2010a; Kwon and Kim, 2012; Shen et al., 2013). A large class of non-convex penalties is considered that includes the SCAD and minimax concave penalties (MCP) (Zhang, 2010a) as examples (Kim and Kwon, 2012; Zhang and Zhang, 2012). We first prove several asymptotic properties of the non-convex penalized estimators such as the weak and strong oracle properties that are standard in sparse linear regression framework (Kim et al., 2016). Second, we prove that optimal tuning parameters in the penalty can be chosen by using an information criterion of which the minimizer exactly identifies true zero and non-zero parameters asymptotically.

The results hold when the candidate maximal order of the AR process increases to infinity and the minimal size of the true non-zero parameters decreases toward zero as the sample size increases. Further, we consider a class of error processes for the AR process that includes independently and identically distributed (iid), autoregressive conditional heteroscedastic (ARCH) and generalized ARCH (GARCH) processes, which is large enough to cover most of the recent and related literature (Nardi and Rinaldo, 2011; Schmidt and Makalic, 2013; Sang and Sun, 2015; Kwon et al., 2017).

We introduce the non-convex penalized estimation for the AR process in Section 2. Asymptotic properties of the penalized estimator are presented in Section 3, introducing an information criterion for the tuning parameter selection. Simulation studies and proofs are given in Section 4 and Appendix, respectively.

2. Non-convex penalized estimation for the AR process

Consider the AR process {yt, t ∈ ℤ},

yt-μ=j=1pβj(yt-j-μ)+ɛt,         t,

where p is a positive integer, μ ∈ ℝ and β = (β1, . . . , βp)T ∈ ℝp are parameters of interest, {ɛt, t ∈ ℤ} is an error process and ℝ and ℤ are the set of real numbers and integers, respectively. We assume that the process is sparse, that is, βj = 0 for some , where denotes the full index set of regression parameters. In this case, we can estimate the true non-zero index set by minimizing the penalized sum of squared residuals

Lλ(u,b)=t=p+1n{(yt-u)-j=1pbj(yt-j-u)}2+2(n-p)j=1pJλ(bj)

with respect to u ∈ ℝ and b = (b1, . . . , bp)T ∈ ℝp. Here, Jλ is a non-convex penalty with tuning parameter λ > 0 that is included in the penalty class defined in the next section.

Let y¯j=t=p+1nyt-j/(n-p) then Lλ(u, b) can be decomposed as:

Lλ(u,b)=LλA(u,b)+LλB(b),

where LλA(u,b)=(n-p){y¯0-j=1pbjy¯j-(1-j=1pbj)u}2 and

LλB(b)=t=p+1n{(yt-y¯0)-j=1pbj(yt-j-y¯j)}2+2(n-p)j=1pJλ(bj).

Let (μ̂λ,A,β̂λ,B) be the minimizer of Lλ(u, b) then it is easy to see that

β^λ,B=(β^1λ,B,,β^pλ,B)T=arg minbpLλB(b)

and μ^λ,A=(y¯0-j=1pβ^jλ,By¯j)/(1-j=1pβ^jλ,B). Once β̂λB is obtained, we can estimate the true index set by using the set {j:β^jλ,B0 }. We often estimate μ by using the sample mean y¯=t=1nyt/n before estimating β, which is the same as estimating β based on the centered samples yt, t = 1, . . . , n. In this case, the penalized estimator, β̂λ,C, can be defined as

β^λ,C=(β^1λ,C,,β^pλ,C)T=arg minbpLλC(b),

where LλC(b)=Lλ(y¯,b).

In this paper, we define the penalized estimator of the regression parameter β as

β^λ=(β^1λ,,β^pλ)T=arg minbpLλG(b;m),

where

LλG(b;m)=Q(b;m)+2(n-p)j=1pJλ(bj),

m = (m0,m1, . . . ,mp)T ∈ ℝp+1 and

Q(b;m)=t=p+1n{(yt-m0)-j=1pbj(yt-j-mj)}2.

The definition is general enough to include above two special cases. If we set m = (0, . . . , p)T and m = (, . . . , )T, then LλG(b;m) becomes LλB(b) and LλC(b), respectively.

3. Asymptotic properties

In this section, we investigate some asymptotic properties such as weak and strong oracle properties that have been proved in sparse linear regression framework. The results hold when p → ∞ and as n → ∞. Further, we propose the GIC that asymptotically recovers the best theoretical non-penalized estimator of the AR process.

3.1. Model assumptions and penalty class

We assume the following conditions (E1)–(E5):

  • (E1) {yt, t ∈ ℤ} is stationary and there exists an absolutely summable sequence {ψi, i ∈ ℕ} such that

    yt-μ=ɛt+i=1ψiɛt-i,         t,

    where ℕ is the set of natural numbers.

  • (E2) {ɛt, t ∈ ℤ} is a white noise with mean 0 and positive variance σɛ2.

  • (E3) {ɛt, t ∈ ℤ} is a sequence of martingale differences with respect to a filtration { , t ∈ ℤ}.

  • (E4) {ɛt, t ∈ ℤ} takes the form

    ɛt=g(ηt,ηt-1,),

    where ηt, t ∈ ℤ, are iid random variables and g is a measurable function.

  • (E5) E(|ɛ1|ν) < ∞ and {ɛt, t ∈ ℤ} is ν-strong stable with ν ≥ 2 (Wu, 2005). Here, ν-strong stability means that

    Δνɛ=i=0δνɛ(i)<,

    where δνɛ(i)={E(ɛi-g(ηi,,η1,η0*,η-1,)ν)}1/ν and {ηt*, t ∈ ℤ} is an iid copy of {ηt, t ∈ ℤ}.

The error process satisfying conditions (E2)–(E5) includes iid, ARCH and GARCH processes. For example, if the errors are are iid with E(|ɛ1|ν) < ∞, then Δνɛ=δνɛ(0)2{E(ɛ1ν)}1/ν< and E(ɛt|ɛt–1, ɛt–2, . . .) = 0, and consequently (E2)–(E5) hold. From Bollerslev (1986) and Wu (2011), GARCH processes satisfy conditions (E2)–(E5) also. Let γ : ℤ → ℝ be the autocovariance function of the process {yt, t ∈ ℤ}. From (E1)–(E3), γ(0)=i=0ψi2σɛ2(0,) and γ(h)=i=0ψiψi+hσɛ2 converges to 0 as h increases to ∞, where ψ0 = 1. Therefore, the autocovariance matrix Γp = γ(ij))1≤i, jp is positive definite by Proposition 5.1.1 of Brockwell and Davis (2006). Also, γ is an absolutely summable autocovariance function because of the absolute summability of {ψi, i ∈ ℕ}.

We consider a class of non-convex penalties where the penalty Jλ satisfies (P1)–(P3):

  • (P1) There exists a decreasing function ∇Jλ : [0,∞) → [0,∞) such that Jλ(x)=0xJλ(t)dt for all x ≥ 0.

  • (P2) There exists a positive constant a such that ∇Jλ(x) = 0 for all x > .

  • (P3) λx/a ≤ ∇Jλ(x) ≤ λ for all 0 ≤ x < .

The penalty class has been studied in high dimensional linear regression model (Kim and Kwon, 2012; Zhang and Zhang, 2012; Kim et al., 2016). The class includes the MC and capped ℓ1 penalties (Zhang and Zhang, 2012; Shen et al., 2013) as special examples that are upper and lower bounds of the class that also include the SCAD penalty. (P1) implies Jλ is a continuous, increasing and concave function defined on [0,∞) and Jλ(0) = 0. If Jλ is differentiable, then ∇Jλ is merely the derivative function of Jλ. From conditions (P1)–(P3), we can see that Jλ has locally sub-differentiable at a point x0 ∈ (−∞, −) ∪ {0} ∪ (,∞) although it is not convex. By (P2), 0 is the unique local subgradient of Jλ at a point x0 when |x0| > , which makes the non-zero elements of the penalized estimator to be unbiased with finite samples. It is easy to see that ∇Jλ(x) ≥ λ/2 for 0 ≤ x/2 from (P3) so that the set of local subgradients of Jλ at the origin includes [−λ/2, λ/2], which makes the penalized estimator to be sparse. These properties of the subgradients play an important role in constructing the oracle properties.

3.2. Oracle properties

Theorem 1

(Weak oracle property) Assume that (E1)–(E5) with ν ≥ 4 and (P1)–(P3) hold. Let, whereis a q× q submatrix of Γp and ||A||is the maximum absolute row sum of a matrixA. If

limn(1+κ2)p2n=0,         limnlog p+κ2log qnλ2=0,         limnλminjSTβj=0

andmsatisfies

max0jpmj-μ=OP(1n),

then the oracle least squares estimator (LSE)

β^o=(β^1o,,β^po)T=argminbopQ(b;m)

is a local minimizer of LλG(b;m) with probability tending to 1, where op={(x1,,xp)Tp:xj=0,jST}.

Theorem 1 shows that the oracle LSE in (3.5) becomes one of local minimizers of (2.6), which often referred as the weak oracle property (Fan and Li, 2001; Kim et al., 2016) in linear regression model. Note that the result holds for any mj, that is n-consistent estimator of μ. For example, we may use the trimmed mean instead of the sample mean when there are some outliers.

The objective function (2.6) is non-convex so that we need a stronger result than the weak oracle property to avoid bad local minimizers (Zhang, 2010a; Kim and Kwon, 2012). The next theorem shows that the oracle LSE is unique so that it becomes unique global minimizer of (2.6).

Theorem 2

(Strong oracle property) Assume that the assumptions of Theorem 1 hold. Let ρ = λminp) where λmin(A) is the smallest eigenvalue of a matrixA. If

limnp2nρ2=0,         limnlog p+κ2log qnρ2λ2=0,         limnλρminjSTβj=0,

then β̂o is the unique local minimizer of LλG(b;m)with probability tending to 1.

Remark 1

Note that κ and ρ often assumed to be fixed positive constants. In this case, the results in Theorems 1 and 2 hold when

minjSTβjλlogpn         and         np2,

which are exactly the same as the results in linear regression model. Here, ab implies b/a = o(1) as n→∞.

Remark 2

In the linear regression ρ = λmin(XTX/n) determines the size of possible minimum non-zero regression coefficient, where X is the design matrix since we need minjβjλlog p/n and minjβjλ/ρlog p/nρ4 for the weak and strong oracle properties, respectively. This implies that the smaller ρ is the larger is required. In the AR process ρ = λminp) and κ = ||Γp(S T )−1|| take the same role since we need minjβjλ(log p+κ2log q)/n and minjβjλ/ρ(log p+κ2log q)/nρ4.

Let β̂o,B and β̂o,C be the oracle LSEs that correspond to m = (0, . . . , p)T and m = (, . . . , )T, respectively. By the functional central limit theorem,

max0jpy¯j-μ2n-pmax1kn|t=1k(yt-μ)|=OP(1n)

and |y¯-μ|=OP(1/n). Hence, from Theorem 2, we can see that two penalized estimators are exactly the same as the oracle LSEs, respectively, which is summarized in the next corollary.

Corollary 1

Assume that the assumptions of Theorem 2 hold then

limnP(β^λ,B=β^o,B)=1      and         limnP(β^λ,C=β^o,C)=1.
Remark 3

From Lemma 2 in Appendix,

maxjSF|β^jo,B-βj|=OP(κlog qn)         and         maxjSF|β^jo,C-βj|=OP(κlog qn),

which implies

maxjSF|β^jλ,B-β^jλ,C|=OP(κlog qn).

Hence two oracle LSEs are asymptotically equivalent so that centering the samples does not affect regression parameter estimation.

3.3. Tuning parameter selection

The most practical and important issue for the penalized estimation is how to select tuning parameters (Nardi and Rinaldo, 2011; Kwon et al., 2017). Some conventional ways can be applied by minimizing validation or cross-validation errors. However, selecting tuning parameters based on prediction errors may produce over-fitted sub-models (Wang et al., 2007, 2009). We propose to use the GIC, which is easy to calculate without using extra independent samples, in order to select the tuning parameters. Given λ > 0, define the GIC as

GIC(λ)=log Q(β^λ;m)+αβ^λ0,

where ||β̂λ||0 is the number of non-zero entries of β̂λ. The next theorem proves that we can select optimal tuning parameter by minimizing the GIC.

Theorem 3

Assume that (E1)–(E5) with ν ≥ 4, and (P1)–(P3) hold. If

limn(1+κ2+ρ-2)p2n=0,limnlog p+κlog q+ρpρ2nminjSTβj=0,limnpα=0,         limnpαρminjSTβj2=0,         limnlog p+κ2log qραn=0,

then there exists λ0such that

limnP(infλΛ+Λ-GIC(λ)>GIC(λ0))=1,

whereand Sλ={jSF:β^jλ0}for λ > 0.

Remark 4

When κ and ρ are positive constants the result in Theorem 3 holds when min

minjSTβjpαplog pn         and         np2.

For example, α = log(log p) log n/n satisfies the condition as suggested in Kwon et al. (2017) and Na (2017) for the adaptive LASSO and GIC, respectively.

4. Numerical studies

We consider two examples to show that the theoretical results hold with finite samples:

Example 1

, βj*=(c0/j)I(jST), μ = 0 and ɛt is iid samples from standard normal distribution, where c0=0.9/(j=1q1/j).

Example 2

is a random subset of and βj*=k=1q(c0/k)I(j=jk).

In each example, we set n = 100 × 2k, k ∈ {0, 1, . . . , 7}, p = 10[n1/3] and q = [n1/4], where [x] is the closest integer from x.

We consider two non-convex penalties for the centered samples, the SCAD and MCP, and compare finite sample performance to some reference methods: four GICs (AIC, HQC, BIC, and GIC6) in Na (2017) and the adaptive LASSO (ALASSO) in Kwon et al. (2017). Tuning parameters are selected by the GIC in (3.8) with α = log(log p) log n/n for all the penalized estimators. We report four measures: sensitivity (SEN), specificity (SPC), selection accuracy (SA), and prediction error (PE) from independent test samples yit, in. The measures are defined as , S^TcSTc/STc, , and i=1n(yit-y^it)/n respectively, where is the index set of non-zero parameters estimated from the methods. We repeat each simulation 200 times and summarize the results in Tables 1, 2 and Figure 1, where all the standard errors are less than 0.03 and omitted.

The PEs are quite similar for all the methods but selection performance are significantly different. The AIC and HQC have better SEN but worse SPC than the others, which result in a very low SA. This shows that the AIC and HQC are not selection consistent (Na, 2017) but over-fit. The SEN, SPC, and SA are increasing to 1 as the sample sizes increases for all the penalized estimators as well as the BIC and GIC6. The simulation results confirm that the theoretical properties provided hold for finite samples.

5. Concluding remarks

We present some asymptotic properties of the non-convex penalized estimators for the AR process, when the maximal order is large and minimal signal size is small. The results show that the non-convex penalized estimation can be used for parameter estimation and model identification simultaneously. This paper is intended to provide a theoretical starting point for future studies on other complex time series analysis.

A referee pointed out that assuming increasing p is unusual in real practice because the AR order does not necessarily increase with the sample size. First of all, we completely agree that many or almost all the AR process may have fixed or finite model orders so that p may not be assumed to be increasing or varying, being independent of the sample size. However we think that the larger the sample size the more parameters become statistically significant, which implies that a fixed choice of small p may prevent us from finding important lags. In this sense, the expression “p→∞as n→∞” does not imply the existence of such an order-increasing model but implies that we must try an order as large as possible, considering the sample size. Besides p→∞, the expression “minj |βj| → 0” can be understood in a similar way.

Acknowledgements

This work was supported by a Kyonggi University Research Grant 2015-099.

Appendix

Let xi j = yij+pmj and zi j = yij+pμ for i = 1, . . . , np and j = 1, . . . , p. Let X = (xi j), Z = (zi j), y = (yp+1, . . . , yn)T, and ɛ = (ɛp+1, . . . , ɛn)T. For j = 1, . . . , p, let Xj and Zj be the jth column vectors of X and Z, respectively. For a set with 1 ≤ i1 < i2 < . . . < ikp, denotes the (np) × k matrix with jth column vector Xij and is defined similarly. Given a p-dimensional vector v = (v1, . . . , vp)T and a set denotes a k-dimensional subvector (vi1, . . . , vik )T of v. For a matrix A, ||A||2 and ||A|| are the induced matrix norms using the Euclidean and maximum norms for vectors, respectively. And for a set means the cardinality of .

Lemma 1

Assume that (2.1), (3.1), (E1)–(E5) with ν ≥ 4, and (3.4) hold. If limn→∞p2/n = 0, then

maxjST|XjTɛ|=OP(nlog q),maxjSF|XjTɛ|=OP(nlog p),maxjSF|XjTu|=OP(n),XTXn-p-Γp=OP(pn),

whereuis a (np)-dimensional vector with all entries 1.

Proof

First, we can show that there exists a positive constant M such that

limnP(maxjS|Zjɛ|>MnlogS)=0

for all non-empty subset in a similar method to the proof of Lemma 1 in Kwon et al. (2017), because {zi j} is a stationary predictable process and {ɛi} is a sequence of stationary martingale differences with respect to . Since XjZj = (μ − mj)u for all and uTɛ=OP(n), (3.4) implies that

supSSF|maxjS|XjTɛ|-maxjS|ZjTɛ||maxjSF|(Xj-Zj)Tɛ|maxjSF|mj-μ|×|uTɛ|=OP(1).

Therefore, for a non-empty subset

maxjS|XjTɛ|=maxjS|ZjTɛ|+OP(1)=OP(nlogS)

and hence (A.1) and (A.2) hold.

Next, note that {yt, t ∈ ℤ} is ν-strong stable under assumptions (E1)–(E5). Applying the functional central limit theorem to {yt, t ∈ ℤ} yields

maxjSF|ZjTu|=OP(n)

and hence

maxjSF|XjTu|=maxjSF|ZjTu-(mj-μ)uTu|maxjSF|ZjTu|+(n-p)×maxjSF|mj-μ|=OP(n)

by (3.4).

Lastly, we can extend Lemma 1 in Kwon et al. (2017) to the case of Z and obtain

ZTZn-p-Γp=OP(pn)

due to the ν-strong stability of {yt, t ∈ ℤ}. Also, for any i,

|XiTXj-ZiTZj|=|(μ-mi)(μ-mj)uTu+(μ-mi)uTZj+(μ-mj)uTZi|(n-p)×(maxjSF|mj-μ|)2+2maxjSF|mj-μ|×maxjSF|uTZj|.

Therefore, from (A.5)–(A.8) and (3.4), we can deduce that

XTXn-p-ΓpZTZn-p-Γp+max1ipj=1p|XiTXj-ZiTZj|n-p=OP(pn).

Lemma 2

Assume that the assumptions of Lemma 1 hold and that limn→∞κ2p2/n = 0. Then,

maxjSF|β^jo-βj|=OP(κlog qn).
Proof

First, β^STo can be written as follows

β^STo=(XSTTXST)-1XSTT(y-m0u)=βST+(XSTTXST)-1XSTTɛ+(XSTTXST)-1XSTTu×{jST(mj-μ)βj-(m0-μ)},

provided that XSTTXST is non-singular. Therefore,

maxjSF|β^jo-βj|=maxjST|β^jo-βj|(XSTTXST)-1×{maxjST|XjTɛ|+maxjSF|XjTu|×max0jp|mj-μ|×(1+jST|βj|)}.

If satisfies that

XSTTXSTn-p-Γp(ST)×Γp(ST)-1<12,

then XSTTXST/(n-p) is non-singular and

(XSTTXSTn-p)-1Γp(ST)-1+(XSTTXSTn-p)-1-Γp(ST)-1Γp(ST)-1+XSTTXST/(n-p)-Γp(ST)Γp(ST)-121-XSTTXST/(n-p)-Γp(ST)×Γp(ST)-12Γp(ST)-1.

Since limn→∞κ2p2/n = 0, the probability that (A.10) holds tends to 1 as n → ∞ by (A.4). Consequently,

(XSTTXST)-1=OP(κn).

According to (A.1), (A.3), and (3.4),

maxjST|XjTɛ|+maxjSF|XjTu|×max0jp|mj-μ|×(1+jSTβj)=OP(nlog q)+OP(q)

and this completes the proof.

Now, let us prove the main theorems in Section 3.

Proof of Theorem 1

Since β^STo satisfies the normal equations XSTTXSTβ^STo=XSTT(y-m0u) and β^SFSTc=0, we have

XST(y-m0u-Xβ^o)=0.

Therefore, if minjSTβ^jo>aλ and maxjSTXjT(y-m0u-Xβ^o)nλ/2, then β̂o becomes a local minimizer of LλG(b;m) by the KKT conditions and (P1)–(P3).

From (3.3) and Lemma 2, we have

limnP(minjST|β^jo|>aλ)limnP(minjST|βj|-maxjSF|β^jo-βj|>aλ)=1.

By using the results of Lemmas 1 and 2 and the boundedness of ||Γp||, we obtain that

maxjST|XjT(y-m0u-Xβ^o)|maxjST|XjTX(β-β^o)|+maxjST|XjTɛ|+maxjST|XjT{(Z-X)β+(μ-m0)u}|XTXmaxjSF|β^jo-βj|+maxjSF|XjTɛ|+maxjSF|XjTu|max0jp|mj-u|(1+jST|βj|)=OP(κnlog q)+OP(nlog p)+OP(q)

and hence (3.3) implies that

limnP(maxjST|XjT(y-m0-Xβ^o)|nλ2)=1.

Therefore

limnP(β^oAλ)limnP(minjST|β^jo|>aλ)+limnP(maxjST|XjT(y-m0-Xβ^o)|nλ2)-1=1,

where is the set of all local minimizers of LλG(b;m).

Proof of Theorem 2

For given λ > 0 and c > 0, let be the set of b ∈ ℝp such that

maxbj=0|XjT(y-m0u-Xb)|n-p<inf0<x<aλ(cx+Jλ(x))sup0<x<aλ(cx+Jλ(x))<cminbj0bj.

Let ρ̂ be the smallest eigenvalue of XTX/(np). By Theorem 1 of Kim and Kwon (2012), with ρ̂ > 0 is a sufficient condition to be a unique local minimizer of LλG(b;m). Also, since = 1 by Theorem 1, it is enough to show

limnP(β^oUρ^,ρ^>0)=1

in order to obtain the result of Theorem 2.

Using the condition limn→∞p2/(2) = 1, (A.4) and the symmetry of XTX/(np) – Γp,

limnP(XTXn-p-Γp2>ρ2)limnP(XTXn-p-Γp>ρ2)=0.

And if ||XTX/(np) – Γp||2ρ/2, then ρ̂ρ/2 > 0 and hence . Therefore, we have

limnP(β^oUρ^,ρ^>0)limnP(β^oUρ^,ρ^>0,XTXn-p-Γp2ρ2)limnP(β^oUρ2,XTXn-p-Γp2ρ2)limnP(β^oUρ2).

Note that limnP(minjSTβj>2maxjSFβ^jo-βj)=1 by (3.3) and Lemma 2. In this case, and hence {jSF:β^jo0}=ST. Since inf0<x<(ρx/2 + ∇Jλ(x)) ≥ min(λ, aρλ/2) and sup0<x<(x + 2∇Jλ(x)/ρ) ≤ + 2λ/ρ by (P3), we have

limnP(β^oUρ2)limnP(maxjST|XjT(y-m0u-Xβ^o)|n-p<min(λ,aρλ2))+limnP(minjST|β^jo|>aλ+2λρ,minjST|βj|>2maxjSF|β^jo-βj|)-1=1

by (3.6) and (A.11). Hence, (A.12) holds.

Lemma 3

For a given, let β˜S=argminbSpQ(b;m)and define

GIC˜(S)=log Q(β˜S;m)+αS,

where Sp={(x1,,xp)Tp:xj=0,jS}. Under the assumptions of Theorem 3,

limnP(infSSTGIC˜(S)>GIC˜(ST))=1.
Proof

First, note that ρ is positive, ||Γp||2 is bounded, and

limnP(ρ2ρ^XTXn-p22Γp2)limnP(XTXn-p-Γpρ2)=1

by (A.13). Given a set , we can show that ZSTɛ2=OP(nS) by extending Lemma 1 of Na (2017). Since maxjSFXjTɛ-ZjTɛ=maxjSFmj-μuTɛ=OP(1),

XSTɛ2ZSTɛ2+SmaxjSF|XjTɛ-ZjTɛ|=OP(nS).

From (A.6), we have

XSTu2SmaxjSF|XjTu|=OP(nS).

Therefore, combining (A.15)–(A.17) and (3.4) gives that

β˜SF-β2I(ρ^ρ2)(XTX)-12{XTɛ2+XTu2max0jpmj-μ(1+jST|βj|)}=OP(pnρ2)

and consequently, we have

β˜SF-β2=OP(pnρ2)         and         Q(β˜SF;m)n-p=σɛ2+oP(1).

For a while, we assume that , ρ̂ρ/2, and Q(β˜SF;m)(n-p)σɛ2/2. Then for any ,

log Q(β˜S;m)-log Q(β˜SF;m)min {(Q(β˜S;m)-Q(β˜SF;m))(2Q(β˜SF;m)),12}min {ρ^β˜SF-β˜S22σɛ2,12}min {ρ^minjST|β˜jSF|2σɛ2,12}min {ρminjSFβj24σɛ2,12},

because log(1 + x) ≥ min(x/2, 1/2) for all x > 0. Therefore, we obtain that

limnP(infSSTGIC˜(S)>GIC˜(SF))limnP(min {ρminjSFβj24σɛ2,12}>pα)=1

by using (3.10), (3.11), (A.15), and (A.18).

Since log(1 + x) ≤ x for all x > 1 and for all , we have

log Q(β˜ST;m)-log Q(β˜S;m)(Q(β˜ST;m)-Q(β˜S;m))Q(β˜S;m)(XTX)-12XSSTcT(In-p-XST(XSTTXST)-1XSTT)(y-m0u)22Q(β˜ST;m)

for all set , where Inp is a (np)-dimensional identity matrix. Also,

supSSTXSSTcT(In-p-XST(XSTTXST)-1XSTT)(y-m0u)2/|SSTc|supSSTXSSTcT(In-p-XST(XSTTXST)-1XSTT)(y-m0u-XSTβST)XTɛ+XTXST(XSTTXST)-1XSTTɛ+(XTu+XTXST(XSTTXST)-1XSTTu)max0jpmj-μ(1+jSTβj)=OP(nlog p)+OP(κnlog q)+OP(qκ).

Hence, (3.11), (A.15), and (A.18) imply that

limnP(infSSTGIC˜(S)-GIC˜(ST)|SSTc|>0)=1.

Therefore, we obtain the result (A.14) by (A.19) and (A.20).

Proof of Theorem 3

By Theorems 1 and 2, there exists a sequence λ0 such that

limnP(β^λ0=β^o,Sλ0=ST)=1.

Since GIC(λ)GIC˜(Sλ) for all λ > 0 and GIC(λ0)=GIC˜(ST) when β̂λ0= β̂o and

limnP(infλΛ+Λ-GIC(λ)>GIC(λ0))limnP(infλΛ+Λ-GIC(λ)>GIC(λ0),β^λ0=β^o,Sλ0=ST)limnP(infλΛ+Λ-GIC˜(Sλ)>GIC˜(ST),β^λ0=β^o,Sλ0=ST)limnP(infSSTGIC˜(S)>GIC˜(ST))-limnP(β^λ0β^oor Sλ0ST)=1

by Lemma 3.

Figures
Fig. 1. Sensitivity, specificity, selection accuracy and prediction error in Example 1 (upper) and 2 (lower). AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty.
TABLES

Table 1

Simulation results from Example 1

npAICHQCBICGIC6ALASSOSCADMCP
SEN100500.8500.7700.7000.6300.6500.6500.650
200600.8980.8050.7150.6550.6700.6680.668
400700.9780.9520.8980.8500.8680.8780.860
800900.9940.9720.9340.8720.9080.9080.908
16001200.9980.9920.9730.9250.9550.9550.955
32001501.0000.9950.9740.9250.9710.9710.970
64001901.0001.0000.9960.9710.9930.9930.993
128002301.0001.0001.0000.9920.9970.9970.997

SPC100500.7840.9090.9680.9860.9820.9820.982
200600.8190.9410.9830.9950.9900.9900.990
400700.8370.9470.9920.9980.9960.9960.996
800900.8380.9550.9940.9990.9970.9960.997
16001200.8280.9590.9970.9990.9980.9980.998
32001500.8440.9670.9981.0000.9990.9990.999
64001900.8400.9670.9981.0001.0001.0000.999
128002300.8690.9720.9981.0001.0001.0001.000

SA100500.0000.0400.0700.0500.0800.0800.080
200600.0000.0300.0700.0700.0600.0600.060
400700.0000.1200.4000.3700.4200.4300.410
800900.0000.1100.5100.3900.4600.4700.470
16001200.0000.0600.6200.4900.6200.6200.630
32001500.0000.0800.6000.3900.6900.7000.690
64001900.0000.0500.6600.7400.8600.8600.850
128002300.0000.0200.7300.8900.9000.9000.900

PE100501.5911.3881.3121.2011.2601.2651.267
200601.1981.1261.0981.0931.1221.1231.120
400701.1161.0741.0471.0441.0561.0541.056
800901.0671.0341.0181.0201.0261.0271.026
16001201.0381.0151.0021.0041.0071.0071.007
32001501.0241.0091.0021.0041.0051.0051.005
64001901.0181.0081.0031.0041.0041.0041.004
128002301.0091.0031.0001.0001.0011.0011.001

AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty; SEN = sensitivity; SPC = specificity; SA = selection accuracy; PE = prediction error.


Table 2

Simulation results from Example 2

npAICHQCBICGIC6ALASSOSCADMCP
SEN100500.8770.8300.7470.6630.5770.5630.563
200600.9100.8350.7180.6550.6020.5980.598
400700.9820.9700.9400.9020.8620.8650.865
800900.9960.9860.9660.9200.9280.9240.918
16001200.9980.9930.9780.9480.9670.9680.967
32001501.0001.0000.9910.9580.9760.9780.976
64001901.0001.0001.0000.9900.9970.9970.997
128002301.0001.0001.0000.9930.9980.9980.998

SPC100500.8000.8970.9610.9800.9850.9860.985
200600.8310.9430.9760.9900.9890.9890.989
400700.8500.9460.9840.9970.9930.9920.992
800900.8440.9600.9930.9980.9950.9950.995
16001200.8340.9570.9961.0000.9980.9980.998
32001500.8480.9670.9961.0000.9980.9980.998
64001900.8430.9640.9971.0000.9990.9990.999
128002300.8530.9690.9981.0000.9990.9990.999

SA100500.0000.0100.0700.1300.1300.1300.130
200600.0000.0600.0800.0600.0700.0700.070
400700.0000.0700.3100.5500.3300.3100.310
800900.0000.0700.5400.5600.5200.5100.490
16001200.0000.0200.5600.6500.6100.6200.620
32001500.0000.0100.5900.6800.6000.6100.600
64001900.0000.0200.5600.8800.7800.7800.780
128002300.0000.0000.6900.9100.8200.8000.830

PE100501.4021.2821.2101.1911.2101.2181.214
200601.2111.1281.1101.1061.1331.1341.132
400701.0961.0551.0341.0261.0501.0481.047
800901.0701.0341.0191.0211.0321.0331.032
16001201.0451.0201.0071.0061.0131.0131.012
32001501.0281.0131.0061.0061.0101.0101.010
64001901.0191.0101.0041.0041.0061.0061.006
128002301.0101.0041.0011.0011.0021.0021.002

AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty; SEN = sensitivity; SPC = specificity; SA = selection accuracy; PE = prediction error.


References
  1. Akaike, H (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics. 21, 243-247.
    CrossRef
  2. Akaike, H 1973. Information theory and an extension of the maximum likelihood principle., Proceeding 2nd International Symposium on Information Theory, pp.267-281.
  3. Akaike, H (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika. 66, 237-242.
    CrossRef
  4. Bollerslev, T (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics. 31, 307-327.
    CrossRef
  5. Brockwell, PJ, and Davis, RA (2006). Time Series: Theory and Methods. New York: Springer
  6. Chen, C (1999). Subset selection of autoregressive time series models. Journal of Forecasting. 18, 505-516.
    CrossRef
  7. Claeskens, G, Croux, C, and Van Kerckhoven, J (2007). Prediction focused model selection for autoregressive models. The Australian and New Zealand Journal of Statistics. 49, 359-379.
    CrossRef
  8. Claeskens, G, and Hjort, NL (2003). The focussed information criterion. Journal of the American Statistical Association. 98, 900-916.
    CrossRef
  9. Fan, J, and Li, R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 96, 1348-1360.
    CrossRef
  10. Fan, J, and Peng, H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 32, 928-961.
    CrossRef
  11. Friedman, J, Hastie, T, Hofling, H, and Tibshirani, R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics. 1, 302-332.
    CrossRef
  12. Hannan, EJ (1980). The estimation of the order of an ARMA process. The Annals of Statistics. 8, 1071-1081.
    CrossRef
  13. Hannan, EJ, and Quinn, BG (1979). The determination of the order of an autoregression. Journal of Royal Statistical Society. 41, 190-195.
  14. Huang, J, Horowitz, JL, and Ma, S (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics. 36, 587-613.
    CrossRef
  15. Kim, Y, Choi, H, and Oh, H (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 103, 1656-1673.
    CrossRef
  16. Kim, Y, Jeon, JJ, and Han, S (2016). A necessary condition for the strong oracle property. Scandinavian Journal of Statistics. 43, 610-624.
    CrossRef
  17. Kim, Y, and Kwon, S (2012). Global optimality of nonconvex penalized estimators. Biometrika. 99, 315-325.
    CrossRef
  18. Kim, Y, Kwon, S, and Choi, H (2012). Consistent model selection criteria on high dimensions. Journal of Machine Learning Research. 13, 1037-1057.
  19. Kwon, S, and Kim, Y (2012). Large sample properties of the SCAD-penalized maximum likelihood estimation on high dimensions. Statistica Sinica. 22, 629-653.
    CrossRef
  20. Kwon, S, Lee, S, and Na, O (2017). Tuning parameter selection for the adaptive lasso in the autoregressive model. Journal of the Korean Statistical Society. 46, 285-297.
    CrossRef
  21. Kwon, S, Oh, S, and Lee, Y (2016). The use of random-effect models for high-dimensional variable selection problems. Computational Statistics & Data Analysis. 103, 401-412.
    CrossRef
  22. Lee, S, Kwon, S, and Kim, Y (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics & Data Analysis. 94, 275-286.
    CrossRef
  23. McClave, J (1975). Subset autoregression. Technometrics. 17, 213-220.
    CrossRef
  24. McLeod, AI, and Zhang, Y (2006). Partial autocorrelation parametrization for subset autoregression. Journal of Time Series Analysis. 27, 599-612.
    CrossRef
  25. Na, O (2017). Generalized information criterion for the ar model. Journal of the Korean Statistical Society. 46, 146-160.
    CrossRef
  26. Nardi, Y, and Rinaldo, A (2011). Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis. 102, 528-549.
    CrossRef
  27. Sang, H, and Sun, Y (2015). Simultaneous sparse model selection and coefficient estimation for heavy-tailed autoregressive processes. Statistics. 49, 187-208.
    CrossRef
  28. Sarkar, A, and Kanjilal, PP (1995). On a method of identification of best subset model from full ar-model. Communications in Statistics-Theory and Methods. 24, 1551-1567.
    CrossRef
  29. Schmidt, DF, and Makalic, E (2013). Estimation of stationary autoregressive models with the Bayesian LASSO. Journal of Time Series Analysis. 34, 517-531.
    CrossRef
  30. Schwarz, G (1978). Estimating the dimension of a model. The Annals of Statistics. 6, 461-464.
    CrossRef
  31. Shen, X, Pan, W, Zhu, Y, and Zhou, H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics. 1, 1-26.
  32. Shibata, R (1976). Selection of the order of an autoregressive model by Akaike셲 information criterion. Biometrika. 63, 117-126.
    CrossRef
  33. Tibshirani, RJ (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B. 58, 267-288.
  34. Tsay, RS (1984). Order selection in nonstationary autoregressive models. The Annals of Statistics. 12, 1425-1433.
    CrossRef
  35. Wang, H, Li, B, and Leng, C (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 71, 671-683.
    CrossRef
  36. Wang, H, Li, R, and Tsai, C (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553-568.
    Pubmed KoreaMed CrossRef
  37. Wu, WB (2005). Nonlinear system theory: another look at dependence. Proceedings of the National Academy of Sciences of the United States of America. 102, 14150-14154.
    Pubmed KoreaMed CrossRef
  38. Wu, WB (2011). Asymptotic theory for stationary processes. Statistics and Its Interface. 4, 207-226.
    CrossRef
  39. Ye, F, and Zhang, CH (2010). Rate Minimaxity of the Lasso and Dantzig selector for the lq loss in lr balls. Journal of Machine Learning Research. 11, 3519-3540.
  40. Zhang, CH (2010a). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 38, 894-942.
    CrossRef
  41. Zhang, CH, and Zhang, T (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 27, 576-593.
    CrossRef
  42. Zhang, T (2010b). Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 11, 1081-1107.
  43. Zou, H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 101, 1418-1429.
    CrossRef
  44. Zou, H, and Li, R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 36, 1509-1533.
    Pubmed KoreaMed CrossRef