search for

CrossRef (0)
Non-convex penalized estimation for the AR process
Communications for Statistical Applications and Methods 2018;25:453-470
Published online September 30, 2018
© 2018 Korean Statistical Society.

Okyoung Naa, and Sunghoon Kwon1,b

aDepartment of Applied Statistics, Kyonggi University, Korea, bDepartment of Applied Statistics, Konkuk University, Korea
Correspondence to: 1Department of Applied Statistics, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Korea. E-mail: shkwon0522@konkuk.ac.kr
Received January 22, 2018; Revised April 3, 2018; Accepted April 4, 2018.

We study how to distinguish the parameters of the sparse autoregressive (AR) process from zero using a non-convex penalized estimation. A class of non-convex penalties are considered that include the smoothly clipped absolute deviation and minimax concave penalties as special examples. We prove that the penalized estimators achieve some standard theoretical properties such as weak and strong oracle properties which have been proved in sparse linear regression framework. The results hold when the maximal order of the AR process increases to infinity and the minimal size of true non-zero parameters decreases toward zero as the sample size increases. Further, we construct a practical method to select tuning parameters using generalized information criterion, of which the minimizer asymptotically recovers the best theoretical non-penalized estimator of the sparse AR process. Simulation studies are given to confirm the theoretical results.

Keywords : autoregressive process, subset selection, non-convex penalty, oracle property, tuning parameter selection
1. Introduction

The autoregressive (AR) process is a basic and important processes for time series data analysis. The usual least square estimation may yield severe modeling biases when the AR model is sparse including zero parameters. Various information criteria (Akaike, 1969, 1973, 1979; Schwarz, 1978; Hannan and Quinn, 1979; Claeskens and Hjort, 2003) have been proposed to identify true non-zero parameters of the AR process, which we call subset selection problem in this paper. Theoretical properties such as asymptotic efficiency and selection consistency of the final sub-process from these information criteria have also been investigated (Shibata, 1976; Hannan and Quinn, 1979; Tsay, 1984; Claeskens and Hjort, 2003; Claeskens et al., 2007). Recently, Na (2017) introduced the generalized information criterion (GIC) (Kim et al., 2012) for the AR process that includes most of information criteria such as Akaike information criterion (AIC) (Akaike, 1973), Hannan-Quinn criterion (HQC) (Hannan, 1980) and Bayesian information criterion (BIC) (Schwarz, 1978). Na (2017) proved that there is a large class of GICs that is selection consistent, including the BIC as an example. However, these approaches suffer from computational complexity since it is almost impossible to compare all the candidate sub-processes when the maximal order is very large (McClave, 1975; Sarkar and Kanjilal, 1995; Chen, 1999; McLeod and Zhang, 2006).

For years, penalized estimation has been studied as an alternative for the subset selection problem (Nardi and Rinaldo, 2011; Schmidt and Makalic, 2013; Sang and Sun, 2015; Kwon et al., 2017). The penalized estimation has nice asymptotic properties such as selection consistency and minimax optimality for various statistical models that include the generalized linear regression model (Fan and Peng, 2004; Zou, 2006; Ye and Zhang, 2010; Kwon and Kim, 2012). However, the advantage of the penalized estimation comes from the efficiency of the computation since there exist many fast and efficient algorithms (Friedman et al., 2007; Kim et al., 2008; Lee et al., 2016). Hence we need not to exhaustively search all the possible candidate sub-models when the AR process has very large model order.

There are many possible penalty functions for the penalized estimation such as the least absolute selection and shrinkage operator (LASSO) (Tibshirani, 1996) and smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001). There are some non-convex penalties including the SCAD that have very distinct advantages against others such as the bridge (Huang et al., 2008; Kim et al., 2016) and log (Zou and Li, 2008; Kwon et al., 2016). First, they produce unbiased estimators of the parameters that help us understand the final model without considering the penalty effect. Second, they exactly select the non-zero parameters with probability tending to one which is impossible for other penalty functions such as the LASSO (Zhang, 2010b; Kim and Kwon, 2012; Zhang and Zhang, 2012) and ridge.

In this paper, we study the subset selection problem by using the non-convex penalized estimation used to identify non-zero parameters in various statistical models (Fan and Li, 2001; Zhang, 2010a; Kwon and Kim, 2012; Shen et al., 2013). A large class of non-convex penalties is considered that includes the SCAD and minimax concave penalties (MCP) (Zhang, 2010a) as examples (Kim and Kwon, 2012; Zhang and Zhang, 2012). We first prove several asymptotic properties of the non-convex penalized estimators such as the weak and strong oracle properties that are standard in sparse linear regression framework (Kim et al., 2016). Second, we prove that optimal tuning parameters in the penalty can be chosen by using an information criterion of which the minimizer exactly identifies true zero and non-zero parameters asymptotically.

The results hold when the candidate maximal order of the AR process increases to infinity and the minimal size of the true non-zero parameters decreases toward zero as the sample size increases. Further, we consider a class of error processes for the AR process that includes independently and identically distributed (iid), autoregressive conditional heteroscedastic (ARCH) and generalized ARCH (GARCH) processes, which is large enough to cover most of the recent and related literature (Nardi and Rinaldo, 2011; Schmidt and Makalic, 2013; Sang and Sun, 2015; Kwon et al., 2017).

We introduce the non-convex penalized estimation for the AR process in Section 2. Asymptotic properties of the penalized estimator are presented in Section 3, introducing an information criterion for the tuning parameter selection. Simulation studies and proofs are given in Section 4 and Appendix, respectively.

2. Non-convex penalized estimation for the AR process

Consider the AR process {yt, t ∈ ℤ},

yt-μ=j=1pβj(yt-j-μ)+ɛt,         t,

where p is a positive integer, μ ∈ ℝ and β = (β1, . . . , βp)T ∈ ℝp are parameters of interest, {ɛt, t ∈ ℤ} is an error process and ℝ and ℤ are the set of real numbers and integers, respectively. We assume that the process is sparse, that is, βj = 0 for some , where denotes the full index set of regression parameters. In this case, we can estimate the true non-zero index set by minimizing the penalized sum of squared residuals


with respect to u ∈ ℝ and b = (b1, . . . , bp)T ∈ ℝp. Here, Jλ is a non-convex penalty with tuning parameter λ > 0 that is included in the penalty class defined in the next section.

Let y¯j=t=p+1nyt-j/(n-p) then Lλ(u, b) can be decomposed as:


where LλA(u,b)=(n-p){y¯0-j=1pbjy¯j-(1-j=1pbj)u}2 and


Let (μ̂λ,A,β̂λ,B) be the minimizer of Lλ(u, b) then it is easy to see that

β^λ,B=(β^1λ,B,,β^pλ,B)T=arg minbpLλB(b)

and μ^λ,A=(y¯0-j=1pβ^jλ,By¯j)/(1-j=1pβ^jλ,B). Once β̂λB is obtained, we can estimate the true index set by using the set {j:β^jλ,B0 }. We often estimate μ by using the sample mean y¯=t=1nyt/n before estimating β, which is the same as estimating β based on the centered samples yt, t = 1, . . . , n. In this case, the penalized estimator, β̂λ,C, can be defined as

β^λ,C=(β^1λ,C,,β^pλ,C)T=arg minbpLλC(b),

where LλC(b)=Lλ(y¯,b).

In this paper, we define the penalized estimator of the regression parameter β as

β^λ=(β^1λ,,β^pλ)T=arg minbpLλG(b;m),



m = (m0,m1, . . . ,mp)T ∈ ℝp+1 and


The definition is general enough to include above two special cases. If we set m = (0, . . . , p)T and m = (, . . . , )T, then LλG(b;m) becomes LλB(b) and LλC(b), respectively.

3. Asymptotic properties

In this section, we investigate some asymptotic properties such as weak and strong oracle properties that have been proved in sparse linear regression framework. The results hold when p → ∞ and as n → ∞. Further, we propose the GIC that asymptotically recovers the best theoretical non-penalized estimator of the AR process.

3.1. Model assumptions and penalty class

We assume the following conditions (E1)–(E5):

  • (E1) {yt, t ∈ ℤ} is stationary and there exists an absolutely summable sequence {ψi, i ∈ ℕ} such that

    yt-μ=ɛt+i=1ψiɛt-i,         t,

    where ℕ is the set of natural numbers.

  • (E2) {ɛt, t ∈ ℤ} is a white noise with mean 0 and positive variance σɛ2.

  • (E3) {ɛt, t ∈ ℤ} is a sequence of martingale differences with respect to a filtration { , t ∈ ℤ}.

  • (E4) {ɛt, t ∈ ℤ} takes the form


    where ηt, t ∈ ℤ, are iid random variables and g is a measurable function.

  • (E5) E(|ɛ1|ν) < ∞ and {ɛt, t ∈ ℤ} is ν-strong stable with ν ≥ 2 (Wu, 2005). Here, ν-strong stability means that


    where δνɛ(i)={E(ɛi-g(ηi,,η1,η0*,η-1,)ν)}1/ν and {ηt*, t ∈ ℤ} is an iid copy of {ηt, t ∈ ℤ}.

The error process satisfying conditions (E2)–(E5) includes iid, ARCH and GARCH processes. For example, if the errors are are iid with E(|ɛ1|ν) < ∞, then Δνɛ=δνɛ(0)2{E(ɛ1ν)}1/ν< and E(ɛt|ɛt–1, ɛt–2, . . .) = 0, and consequently (E2)–(E5) hold. From Bollerslev (1986) and Wu (2011), GARCH processes satisfy conditions (E2)–(E5) also. Let γ : ℤ → ℝ be the autocovariance function of the process {yt, t ∈ ℤ}. From (E1)–(E3), γ(0)=i=0ψi2σɛ2(0,) and γ(h)=i=0ψiψi+hσɛ2 converges to 0 as h increases to ∞, where ψ0 = 1. Therefore, the autocovariance matrix Γp = γ(ij))1≤i, jp is positive definite by Proposition 5.1.1 of Brockwell and Davis (2006). Also, γ is an absolutely summable autocovariance function because of the absolute summability of {ψi, i ∈ ℕ}.

We consider a class of non-convex penalties where the penalty Jλ satisfies (P1)–(P3):

  • (P1) There exists a decreasing function ∇Jλ : [0,∞) → [0,∞) such that Jλ(x)=0xJλ(t)dt for all x ≥ 0.

  • (P2) There exists a positive constant a such that ∇Jλ(x) = 0 for all x > .

  • (P3) λx/a ≤ ∇Jλ(x) ≤ λ for all 0 ≤ x < .

The penalty class has been studied in high dimensional linear regression model (Kim and Kwon, 2012; Zhang and Zhang, 2012; Kim et al., 2016). The class includes the MC and capped ℓ1 penalties (Zhang and Zhang, 2012; Shen et al., 2013) as special examples that are upper and lower bounds of the class that also include the SCAD penalty. (P1) implies Jλ is a continuous, increasing and concave function defined on [0,∞) and Jλ(0) = 0. If Jλ is differentiable, then ∇Jλ is merely the derivative function of Jλ. From conditions (P1)–(P3), we can see that Jλ has locally sub-differentiable at a point x0 ∈ (−∞, −) ∪ {0} ∪ (,∞) although it is not convex. By (P2), 0 is the unique local subgradient of Jλ at a point x0 when |x0| > , which makes the non-zero elements of the penalized estimator to be unbiased with finite samples. It is easy to see that ∇Jλ(x) ≥ λ/2 for 0 ≤ x/2 from (P3) so that the set of local subgradients of Jλ at the origin includes [−λ/2, λ/2], which makes the penalized estimator to be sparse. These properties of the subgradients play an important role in constructing the oracle properties.

3.2. Oracle properties

Theorem 1

(Weak oracle property) Assume that (E1)–(E5) with ν ≥ 4 and (P1)–(P3) hold. Let, whereis a q× q submatrix of Γp and ||A||is the maximum absolute row sum of a matrixA. If

limn(1+κ2)p2n=0,         limnlog p+κ2log qnλ2=0,         limnλminjSTβj=0



then the oracle least squares estimator (LSE)


is a local minimizer of LλG(b;m) with probability tending to 1, where op={(x1,,xp)Tp:xj=0,jST}.

Theorem 1 shows that the oracle LSE in (3.5) becomes one of local minimizers of (2.6), which often referred as the weak oracle property (Fan and Li, 2001; Kim et al., 2016) in linear regression model. Note that the result holds for any mj, that is n-consistent estimator of μ. For example, we may use the trimmed mean instead of the sample mean when there are some outliers.

The objective function (2.6) is non-convex so that we need a stronger result than the weak oracle property to avoid bad local minimizers (Zhang, 2010a; Kim and Kwon, 2012). The next theorem shows that the oracle LSE is unique so that it becomes unique global minimizer of (2.6).

Theorem 2

(Strong oracle property) Assume that the assumptions of Theorem 1 hold. Let ρ = λminp) where λmin(A) is the smallest eigenvalue of a matrixA. If

limnp2nρ2=0,         limnlog p+κ2log qnρ2λ2=0,         limnλρminjSTβj=0,

then β̂o is the unique local minimizer of LλG(b;m)with probability tending to 1.

Remark 1

Note that κ and ρ often assumed to be fixed positive constants. In this case, the results in Theorems 1 and 2 hold when

minjSTβjλlogpn         and         np2,

which are exactly the same as the results in linear regression model. Here, ab implies b/a = o(1) as n→∞.

Remark 2

In the linear regression ρ = λmin(XTX/n) determines the size of possible minimum non-zero regression coefficient, where X is the design matrix since we need minjβjλlog p/n and minjβjλ/ρlog p/nρ4 for the weak and strong oracle properties, respectively. This implies that the smaller ρ is the larger is required. In the AR process ρ = λminp) and κ = ||Γp(S T )−1|| take the same role since we need minjβjλ(log p+κ2log q)/n and minjβjλ/ρ(log p+κ2log q)/nρ4.

Let β̂o,B and β̂o,C be the oracle LSEs that correspond to m = (0, . . . , p)T and m = (, . . . , )T, respectively. By the functional central limit theorem,


and |y¯-μ|=OP(1/n). Hence, from Theorem 2, we can see that two penalized estimators are exactly the same as the oracle LSEs, respectively, which is summarized in the next corollary.

Corollary 1

Assume that the assumptions of Theorem 2 hold then

limnP(β^λ,B=β^o,B)=1      and         limnP(β^λ,C=β^o,C)=1.
Remark 3

From Lemma 2 in Appendix,

maxjSF|β^jo,B-βj|=OP(κlog qn)         and         maxjSF|β^jo,C-βj|=OP(κlog qn),

which implies

maxjSF|β^jλ,B-β^jλ,C|=OP(κlog qn).

Hence two oracle LSEs are asymptotically equivalent so that centering the samples does not affect regression parameter estimation.

3.3. Tuning parameter selection

The most practical and important issue for the penalized estimation is how to select tuning parameters (Nardi and Rinaldo, 2011; Kwon et al., 2017). Some conventional ways can be applied by minimizing validation or cross-validation errors. However, selecting tuning parameters based on prediction errors may produce over-fitted sub-models (Wang et al., 2007, 2009). We propose to use the GIC, which is easy to calculate without using extra independent samples, in order to select the tuning parameters. Given λ > 0, define the GIC as

GIC(λ)=log Q(β^λ;m)+αβ^λ0,

where ||β̂λ||0 is the number of non-zero entries of β̂λ. The next theorem proves that we can select optimal tuning parameter by minimizing the GIC.

Theorem 3

Assume that (E1)–(E5) with ν ≥ 4, and (P1)–(P3) hold. If

limn(1+κ2+ρ-2)p2n=0,limnlog p+κlog q+ρpρ2nminjSTβj=0,limnpα=0,         limnpαρminjSTβj2=0,         limnlog p+κ2log qραn=0,

then there exists λ0such that


whereand Sλ={jSF:β^jλ0}for λ > 0.

Remark 4

When κ and ρ are positive constants the result in Theorem 3 holds when min

minjSTβjpαplog pn         and         np2.

For example, α = log(log p) log n/n satisfies the condition as suggested in Kwon et al. (2017) and Na (2017) for the adaptive LASSO and GIC, respectively.

4. Numerical studies

We consider two examples to show that the theoretical results hold with finite samples:

Example 1

, βj*=(c0/j)I(jST), μ = 0 and ɛt is iid samples from standard normal distribution, where c0=0.9/(j=1q1/j).

Example 2

is a random subset of and βj*=k=1q(c0/k)I(j=jk).

In each example, we set n = 100 × 2k, k ∈ {0, 1, . . . , 7}, p = 10[n1/3] and q = [n1/4], where [x] is the closest integer from x.

We consider two non-convex penalties for the centered samples, the SCAD and MCP, and compare finite sample performance to some reference methods: four GICs (AIC, HQC, BIC, and GIC6) in Na (2017) and the adaptive LASSO (ALASSO) in Kwon et al. (2017). Tuning parameters are selected by the GIC in (3.8) with α = log(log p) log n/n for all the penalized estimators. We report four measures: sensitivity (SEN), specificity (SPC), selection accuracy (SA), and prediction error (PE) from independent test samples yit, in. The measures are defined as , S^TcSTc/STc, , and i=1n(yit-y^it)/n respectively, where is the index set of non-zero parameters estimated from the methods. We repeat each simulation 200 times and summarize the results in Tables 1, 2 and Figure 1, where all the standard errors are less than 0.03 and omitted.

The PEs are quite similar for all the methods but selection performance are significantly different. The AIC and HQC have better SEN but worse SPC than the others, which result in a very low SA. This shows that the AIC and HQC are not selection consistent (Na, 2017) but over-fit. The SEN, SPC, and SA are increasing to 1 as the sample sizes increases for all the penalized estimators as well as the BIC and GIC6. The simulation results confirm that the theoretical properties provided hold for finite samples.

5. Concluding remarks

We present some asymptotic properties of the non-convex penalized estimators for the AR process, when the maximal order is large and minimal signal size is small. The results show that the non-convex penalized estimation can be used for parameter estimation and model identification simultaneously. This paper is intended to provide a theoretical starting point for future studies on other complex time series analysis.

A referee pointed out that assuming increasing p is unusual in real practice because the AR order does not necessarily increase with the sample size. First of all, we completely agree that many or almost all the AR process may have fixed or finite model orders so that p may not be assumed to be increasing or varying, being independent of the sample size. However we think that the larger the sample size the more parameters become statistically significant, which implies that a fixed choice of small p may prevent us from finding important lags. In this sense, the expression “p→∞as n→∞” does not imply the existence of such an order-increasing model but implies that we must try an order as large as possible, considering the sample size. Besides p→∞, the expression “minj |βj| → 0” can be understood in a similar way.


This work was supported by a Kyonggi University Research Grant 2015-099.


Let xi j = yij+pmj and zi j = yij+pμ for i = 1, . . . , np and j = 1, . . . , p. Let X = (xi j), Z = (zi j), y = (yp+1, . . . , yn)T, and ɛ = (ɛp+1, . . . , ɛn)T. For j = 1, . . . , p, let Xj and Zj be the jth column vectors of X and Z, respectively. For a set with 1 ≤ i1 < i2 < . . . < ikp, denotes the (np) × k matrix with jth column vector Xij and is defined similarly. Given a p-dimensional vector v = (v1, . . . , vp)T and a set denotes a k-dimensional subvector (vi1, . . . , vik )T of v. For a matrix A, ||A||2 and ||A|| are the induced matrix norms using the Euclidean and maximum norms for vectors, respectively. And for a set means the cardinality of .

Lemma 1

Assume that (2.1), (3.1), (E1)–(E5) with ν ≥ 4, and (3.4) hold. If limn→∞p2/n = 0, then

maxjST|XjTɛ|=OP(nlog q),maxjSF|XjTɛ|=OP(nlog p),maxjSF|XjTu|=OP(n),XTXn-p-Γp=OP(pn),

whereuis a (np)-dimensional vector with all entries 1.


First, we can show that there exists a positive constant M such that


for all non-empty subset in a similar method to the proof of Lemma 1 in Kwon et al. (2017), because {zi j} is a stationary predictable process and {ɛi} is a sequence of stationary martingale differences with respect to . Since XjZj = (μ − mj)u for all and uTɛ=OP(n), (3.4) implies that


Therefore, for a non-empty subset


and hence (A.1) and (A.2) hold.

Next, note that {yt, t ∈ ℤ} is ν-strong stable under assumptions (E1)–(E5). Applying the functional central limit theorem to {yt, t ∈ ℤ} yields


and hence


by (3.4).

Lastly, we can extend Lemma 1 in Kwon et al. (2017) to the case of Z and obtain


due to the ν-strong stability of {yt, t ∈ ℤ}. Also, for any i,


Therefore, from (A.5)–(A.8) and (3.4), we can deduce that


Lemma 2

Assume that the assumptions of Lemma 1 hold and that limn→∞κ2p2/n = 0. Then,

maxjSF|β^jo-βj|=OP(κlog qn).

First, β^STo can be written as follows


provided that XSTTXST is non-singular. Therefore,


If satisfies that


then XSTTXST/(n-p) is non-singular and


Since limn→∞κ2p2/n = 0, the probability that (A.10) holds tends to 1 as n → ∞ by (A.4). Consequently,


According to (A.1), (A.3), and (3.4),

maxjST|XjTɛ|+maxjSF|XjTu|×max0jp|mj-μ|×(1+jSTβj)=OP(nlog q)+OP(q)

and this completes the proof.

Now, let us prove the main theorems in Section 3.

Proof of Theorem 1

Since β^STo satisfies the normal equations XSTTXSTβ^STo=XSTT(y-m0u) and β^SFSTc=0, we have


Therefore, if minjSTβ^jo>aλ and maxjSTXjT(y-m0u-Xβ^o)nλ/2, then β̂o becomes a local minimizer of LλG(b;m) by the KKT conditions and (P1)–(P3).

From (3.3) and Lemma 2, we have


By using the results of Lemmas 1 and 2 and the boundedness of ||Γp||, we obtain that

maxjST|XjT(y-m0u-Xβ^o)|maxjST|XjTX(β-β^o)|+maxjST|XjTɛ|+maxjST|XjT{(Z-X)β+(μ-m0)u}|XTXmaxjSF|β^jo-βj|+maxjSF|XjTɛ|+maxjSF|XjTu|max0jp|mj-u|(1+jST|βj|)=OP(κnlog q)+OP(nlog p)+OP(q)

and hence (3.3) implies that




where is the set of all local minimizers of LλG(b;m).

Proof of Theorem 2

For given λ > 0 and c > 0, let be the set of b ∈ ℝp such that


Let ρ̂ be the smallest eigenvalue of XTX/(np). By Theorem 1 of Kim and Kwon (2012), with ρ̂ > 0 is a sufficient condition to be a unique local minimizer of LλG(b;m). Also, since = 1 by Theorem 1, it is enough to show


in order to obtain the result of Theorem 2.

Using the condition limn→∞p2/(2) = 1, (A.4) and the symmetry of XTX/(np) – Γp,


And if ||XTX/(np) – Γp||2ρ/2, then ρ̂ρ/2 > 0 and hence . Therefore, we have


Note that limnP(minjSTβj>2maxjSFβ^jo-βj)=1 by (3.3) and Lemma 2. In this case, and hence {jSF:β^jo0}=ST. Since inf0<x<(ρx/2 + ∇Jλ(x)) ≥ min(λ, aρλ/2) and sup0<x<(x + 2∇Jλ(x)/ρ) ≤ + 2λ/ρ by (P3), we have


by (3.6) and (A.11). Hence, (A.12) holds.

Lemma 3

For a given, let β˜S=argminbSpQ(b;m)and define

GIC˜(S)=log Q(β˜S;m)+αS,

where Sp={(x1,,xp)Tp:xj=0,jS}. Under the assumptions of Theorem 3,


First, note that ρ is positive, ||Γp||2 is bounded, and


by (A.13). Given a set , we can show that ZSTɛ2=OP(nS) by extending Lemma 1 of Na (2017). Since maxjSFXjTɛ-ZjTɛ=maxjSFmj-μuTɛ=OP(1),


From (A.6), we have


Therefore, combining (A.15)–(A.17) and (3.4) gives that


and consequently, we have

β˜SF-β2=OP(pnρ2)         and         Q(β˜SF;m)n-p=σɛ2+oP(1).

For a while, we assume that , ρ̂ρ/2, and Q(β˜SF;m)(n-p)σɛ2/2. Then for any ,

log Q(β˜S;m)-log Q(β˜SF;m)min {(Q(β˜S;m)-Q(β˜SF;m))(2Q(β˜SF;m)),12}min {ρ^β˜SF-β˜S22σɛ2,12}min {ρ^minjST|β˜jSF|2σɛ2,12}min {ρminjSFβj24σɛ2,12},

because log(1 + x) ≥ min(x/2, 1/2) for all x > 0. Therefore, we obtain that

limnP(infSSTGIC˜(S)>GIC˜(SF))limnP(min {ρminjSFβj24σɛ2,12}>pα)=1

by using (3.10), (3.11), (A.15), and (A.18).

Since log(1 + x) ≤ x for all x > 1 and for all , we have

log Q(β˜ST;m)-log Q(β˜S;m)(Q(β˜ST;m)-Q(β˜S;m))Q(β˜S;m)(XTX)-12XSSTcT(In-p-XST(XSTTXST)-1XSTT)(y-m0u)22Q(β˜ST;m)

for all set , where Inp is a (np)-dimensional identity matrix. Also,

supSSTXSSTcT(In-p-XST(XSTTXST)-1XSTT)(y-m0u)2/|SSTc|supSSTXSSTcT(In-p-XST(XSTTXST)-1XSTT)(y-m0u-XSTβST)XTɛ+XTXST(XSTTXST)-1XSTTɛ+(XTu+XTXST(XSTTXST)-1XSTTu)max0jpmj-μ(1+jSTβj)=OP(nlog p)+OP(κnlog q)+OP(qκ).

Hence, (3.11), (A.15), and (A.18) imply that


Therefore, we obtain the result (A.14) by (A.19) and (A.20).

Proof of Theorem 3

By Theorems 1 and 2, there exists a sequence λ0 such that


Since GIC(λ)GIC˜(Sλ) for all λ > 0 and GIC(λ0)=GIC˜(ST) when β̂λ0= β̂o and

limnP(infλΛ+Λ-GIC(λ)>GIC(λ0))limnP(infλΛ+Λ-GIC(λ)>GIC(λ0),β^λ0=β^o,Sλ0=ST)limnP(infλΛ+Λ-GIC˜(Sλ)>GIC˜(ST),β^λ0=β^o,Sλ0=ST)limnP(infSSTGIC˜(S)>GIC˜(ST))-limnP(β^λ0β^oor Sλ0ST)=1

by Lemma 3.

Fig. 1. Sensitivity, specificity, selection accuracy and prediction error in Example 1 (upper) and 2 (lower). AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty.

Table 1

Simulation results from Example 1





AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty; SEN = sensitivity; SPC = specificity; SA = selection accuracy; PE = prediction error.

Table 2

Simulation results from Example 2





AIC = Akaike information criterion; HQC = Hannan-Quinn criterion; BIC = Bayesian information criterion; GIC = generalized information criterion; ALASSO = adaptive least absolute selection and shrinkage operator; SCAD = smoothly clipped absolute deviation MCP = minimax concave penalty; SEN = sensitivity; SPC = specificity; SA = selection accuracy; PE = prediction error.

  1. Akaike, H (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics. 21, 243-247.
  2. Akaike, H 1973. Information theory and an extension of the maximum likelihood principle., Proceeding 2nd International Symposium on Information Theory, pp.267-281.
  3. Akaike, H (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika. 66, 237-242.
  4. Bollerslev, T (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics. 31, 307-327.
  5. Brockwell, PJ, and Davis, RA (2006). Time Series: Theory and Methods. New York: Springer
  6. Chen, C (1999). Subset selection of autoregressive time series models. Journal of Forecasting. 18, 505-516.
  7. Claeskens, G, Croux, C, and Van Kerckhoven, J (2007). Prediction focused model selection for autoregressive models. The Australian and New Zealand Journal of Statistics. 49, 359-379.
  8. Claeskens, G, and Hjort, NL (2003). The focussed information criterion. Journal of the American Statistical Association. 98, 900-916.
  9. Fan, J, and Li, R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 96, 1348-1360.
  10. Fan, J, and Peng, H (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics. 32, 928-961.
  11. Friedman, J, Hastie, T, Hofling, H, and Tibshirani, R (2007). Pathwise coordinate optimization. The Annals of Applied Statistics. 1, 302-332.
  12. Hannan, EJ (1980). The estimation of the order of an ARMA process. The Annals of Statistics. 8, 1071-1081.
  13. Hannan, EJ, and Quinn, BG (1979). The determination of the order of an autoregression. Journal of Royal Statistical Society. 41, 190-195.
  14. Huang, J, Horowitz, JL, and Ma, S (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics. 36, 587-613.
  15. Kim, Y, Choi, H, and Oh, H (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association. 103, 1656-1673.
  16. Kim, Y, Jeon, JJ, and Han, S (2016). A necessary condition for the strong oracle property. Scandinavian Journal of Statistics. 43, 610-624.
  17. Kim, Y, and Kwon, S (2012). Global optimality of nonconvex penalized estimators. Biometrika. 99, 315-325.
  18. Kim, Y, Kwon, S, and Choi, H (2012). Consistent model selection criteria on high dimensions. Journal of Machine Learning Research. 13, 1037-1057.
  19. Kwon, S, and Kim, Y (2012). Large sample properties of the SCAD-penalized maximum likelihood estimation on high dimensions. Statistica Sinica. 22, 629-653.
  20. Kwon, S, Lee, S, and Na, O (2017). Tuning parameter selection for the adaptive lasso in the autoregressive model. Journal of the Korean Statistical Society. 46, 285-297.
  21. Kwon, S, Oh, S, and Lee, Y (2016). The use of random-effect models for high-dimensional variable selection problems. Computational Statistics & Data Analysis. 103, 401-412.
  22. Lee, S, Kwon, S, and Kim, Y (2016). A modified local quadratic approximation algorithm for penalized optimization problems. Computational Statistics & Data Analysis. 94, 275-286.
  23. McClave, J (1975). Subset autoregression. Technometrics. 17, 213-220.
  24. McLeod, AI, and Zhang, Y (2006). Partial autocorrelation parametrization for subset autoregression. Journal of Time Series Analysis. 27, 599-612.
  25. Na, O (2017). Generalized information criterion for the ar model. Journal of the Korean Statistical Society. 46, 146-160.
  26. Nardi, Y, and Rinaldo, A (2011). Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis. 102, 528-549.
  27. Sang, H, and Sun, Y (2015). Simultaneous sparse model selection and coefficient estimation for heavy-tailed autoregressive processes. Statistics. 49, 187-208.
  28. Sarkar, A, and Kanjilal, PP (1995). On a method of identification of best subset model from full ar-model. Communications in Statistics-Theory and Methods. 24, 1551-1567.
  29. Schmidt, DF, and Makalic, E (2013). Estimation of stationary autoregressive models with the Bayesian LASSO. Journal of Time Series Analysis. 34, 517-531.
  30. Schwarz, G (1978). Estimating the dimension of a model. The Annals of Statistics. 6, 461-464.
  31. Shen, X, Pan, W, Zhu, Y, and Zhou, H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics. 1, 1-26.
  32. Shibata, R (1976). Selection of the order of an autoregressive model by Akaike셲 information criterion. Biometrika. 63, 117-126.
  33. Tibshirani, RJ (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B. 58, 267-288.
  34. Tsay, RS (1984). Order selection in nonstationary autoregressive models. The Annals of Statistics. 12, 1425-1433.
  35. Wang, H, Li, B, and Leng, C (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 71, 671-683.
  36. Wang, H, Li, R, and Tsai, C (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 94, 553-568.
    Pubmed KoreaMed CrossRef
  37. Wu, WB (2005). Nonlinear system theory: another look at dependence. Proceedings of the National Academy of Sciences of the United States of America. 102, 14150-14154.
    Pubmed KoreaMed CrossRef
  38. Wu, WB (2011). Asymptotic theory for stationary processes. Statistics and Its Interface. 4, 207-226.
  39. Ye, F, and Zhang, CH (2010). Rate Minimaxity of the Lasso and Dantzig selector for the lq loss in lr balls. Journal of Machine Learning Research. 11, 3519-3540.
  40. Zhang, CH (2010a). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 38, 894-942.
  41. Zhang, CH, and Zhang, T (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 27, 576-593.
  42. Zhang, T (2010b). Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research. 11, 1081-1107.
  43. Zou, H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 101, 1418-1429.
  44. Zou, H, and Li, R (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 36, 1509-1533.
    Pubmed KoreaMed CrossRef