TEXT SIZE

CrossRef (0)
ADMM for least square problems with pairwise-difference penalties for coefficient grouping

Soohee Parka, Seung Jun Shin1,a

aDepartment of Statistics, Korea University, Republic of Korea
Correspondence to: 1Department of Statistics, Korea University, 145 Anam-Ro, Sungbuk-Gu, Seoul 02841, Korea. E-mail: sjshin@korea.ac.kr

This work is partially funded by the National Research Foundation of Korea (NRF) grants 2018R1D1A1B07043034 and 2019R1A4A1028134.
Received November 30, 2021; Revised January 3, 2022; Accepted January 20, 2022.
Abstract
In the era of bigdata, scalability is a crucial issue in learning models. Among many others, the Alternating Direction of Multipliers (ADMM, Boyd et al., 2011) algorithm has gained great popularity in solving large-scale problems eciently. In this article, we propose applying the ADMM algorithm to solve the least square problem penalized by the pairwise-di erence penalty, frequently used to identify group structures among coecients. ADMM algorithm enables us to solve the high-dimensional problem eciently in a unified fashion and thus allows us to employ several di erent types of penalty functions such as LASSO, Elastic Net, SCAD, and MCP for the penalized problem. Additionally, the ADMM algorithm naturally extends the algorithm to distributed computation and real-time updates, both desirable when dealing with large amounts of data.
Keywords : alternating direction of multipliers, grouping coefficients, real-time update, high-dimensional data
1. Introduction

Grouping coefficients in regression problems is an important but challenging task in many contemporary applications. In genomics, for example, grouping genes (i.e., variables) that play a similar role is often of interest in order to have a more precise understanding of how genes affect the human body.

The coefficients grouping can be regarded as the generalization of sparsity (Ke et al., 2015). There have been many studies for grouping coefficients through regularization. A canonical example is fused LASSO (Tibshirani et al., 2005) that penalizes the difference between adjacent coefficients. The fused LASSO requires the coefficients to be pre-ordered, but we may face many cases where we do not have prior information about ordering. Bondell and Reich (2008) applied pairwise L regulation, and Shen and Huang (2010) proposed regularization to the difference of the pairwise coefficients for data-driven grouping pursuit, both of which did not require the pre-ordering of the regression coefficients. Ke et al. (2015) proposed the CARDS algorithm based on a hybrid version of fused LASSO penalty and total variation penalty (Harchaoui and Leévy-Leduc, 2010) to detect homogeneous groups of regression coefficients.

This article focuses on the pairwise-difference penalty proposed by Shen and Huang (2010). For a given set of data y = (y1, y2, . . . , yn)T ∈ ℝn and X = (x1, x2, . . . , xn) ∈ ℝn×p, our goal is to solve the following problem with by the pairwise-difference penalty in order to to identify the grouping structure of regression coefficients.

minβ12y-Xβ22+i<jpλ(βi-βj),

where β = (β1, β2, . . . , βp)T and pλ(·) denotes a sparsity-pursing penalty function with a tuning parameter λ. There are various choices for the penalty function. We consider four popular penalties including LASSO (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), smoothly clipped absolute deviance (SCAD) (Fan and Li, 2001), and minimax concave penalty (MCP) (Zhang, 2010).

We should point out that (1.1) does not require preordering because it penalizes all pairwise distances between coefficients. However, due to the larger number of penalty terms, its computation may not always be straightforward. To tackle this, we propose to apply the Alternating Direction of Multipliers (ADMM) (Boyd et al., 2011) algorithm. The ADMM is an efficient algorithm, which can be summarized as the intersection of optimizing the primal problem and the dual problem based on the Lagrangian theory. In addition, the ADMM naturally allows us to extend to the distributed (or parallel) computation desirable for handling large-scale data. The ADMM algorithm has been applied in a wide variety of applications in penalized regressions. Wahlberg et al. (2012) proposed to solve the fused lasso problem with the ADMM. Zhu (2017) applied the augmented ADMM to solve the generalized lasso problems. Jeon et al. (2017) employed the ADMM to solve the grouping-coefficient-problem for the generalized linear model with non-convex penalty functions.

The paper is organized as follows. Section 2 provides a brief overview of the ADMM algorithm. In Section 3, we describe how the ADMM is applied to solve (1.1) with various penalty functions. The extension to distributed computation and real-time update for (1.1) for large data sets is described in Section 4. Section 5 illustrates the proposed algorithm to the simulated data and conclusions follow in Section 6.

2. Alternating direction method of multipliers

Consider the following linearly constrained optimization problem with a variable x ∈ ℝp, and two convex functions f, g:

minxf(x)+g(x).

Introducing an auxiliary variable z ∈ ℝp, (2.1) can be expressed as follows:

p*=minx,zf(x)+g(z)         subject to x=z.

The primal problem and its Lagrangian function of (2.2) is

L(x,z,a)=f(x)+g(z)+aT(x-z),

where a ∈ ℝp denote Lagrangian multipliers. By Lagrangian theory (Boyd et al., 2004), the dual problem of (2.1) is d*=maxad(a), where d(a) = infx,zL(x, z, a) is referred to as the dual function of (2.2). It can be readily known that d*p*. Under certain conditions, we have d* = p* (strong duality) under which one can use the dual problem to solve the primal problem (Boyd et al., 2004). The dual problem is often much simpler than the primal problem since the dual function is always convex even if the primal function is not. Support vector machine is a well-known example.

One natural way to solve the dual problem of (2.2) is the gradient ascent algorithm that iteratively updates the primal function and the dual parameter based on the gradient. Yet, it is known to be unstable (Boyd et al., 2011). As an alternative, the augmented Lagrangian introduces an additional L2 regularization term to stabilize the optimization as follows.

Lρ(x,z,a)=f(x)+g(z)+aT(x-z)+ρ2x-z22,

where ρ > 0 is a tuning parameter that controls the step size of the update based on the gradient. In addition, we equivalently rewrite (2.4) as the following rescaled form.

Lρ(x,z,u)=f(x)+g(z)+ρ2x-z+u22-ρ2u22,

where u = a/ρ is the scaled dual variable.

Note that (2.5) is additively separable. The ADMM algorithm iteratively updates x, z, u in an alternating way, which explains its name. Namely, the ADMM algorithm consists of the following three steps of iterations:

xt+1=arg minx         f(x)+(ρ2)x-zt+ut22,zt+1=arg minz         g(z)+(ρ2)xt+1-z+ut22,ut+1=ut+ρ(xt+1-zt+1).

We remark that both x and z can be efficiently updated using the proximal operator (Parikh and Boyd, 2014). The proximal operator of f (scaled by η) denoted with proxη f is defined by

proxηf(x)=arg minv[f(v)+12ηv-x22],

where η > 0 is a scale parameter. That is, proxf (x) can be interpreted as a minimizer of f near x. Employing the proximal operators we can respectively rewrite (2.6) and (2.7) as

xt+1=proxfρ(zt-ut)         and         zt+1=proxgρ(xt+1-ut).

This explains why ADMM is efficient when the complexity of proxf and proxg is much smaller than that of proxf+g.

Finally, we remark the convergence conditions for the ADMM algorithm. Boyd et al. (2011) showed that the ADMM algorithm converges under the following two assumptions:

A1. The functions f : ℝp → ℝ ∪ {+∞} and g : ℝm → ℝ ∪ {+∞} are closed, proper, convex.

A2. The unaugmented Lagrangian with ρ = 0 has a saddle point

The latter assumption A2 implies that the strong duality (Boyd et al., 2011).

During the iterations, the ADMM yields two kinds of residuals: the primal residual and the dual residual at iteration t + 1, which are respectively given by

rt+1=xt+1-zt+1         and         st+1=-ρ(zt+1-zt).

Both residuals converge to zero as the iterations proceed and thus the ADMM algorithm is terminated if rt+122 and st+122 are sufficiently small, i.e., rt+122<ɛ1 and ||st+1|| < ε2 for sufficiently small values of ε1 and ε2. In this article, we set

ɛ1=nɛabs+ɛrelmax{xk2,zk2}         and         ɛ2=nɛabs+ɛreluk2

with εabs = 10−4, εrel = 10−2 as suggested by Boyd et al., 2011.

3. Application of ADMM to the least square problem with the pairwise-difference penalty

In what follows, we apply the ADMM algorithm to solve (1.1) with four popular penalty functions including LASSO, elastic net, SCAD, and MCP. Toward this, (1.1) can be equivalently rewritten as follows

minβ12y-Xβ22+i<jpλ([Dβ]i),

where D ∈ ℝp(p−1)/2×p is a transformation matrix for the pairwise-difference such that Dβ = (β1β2, β1β3, . . . , βp−1βp)T ∈ ℝp(p−1)/2, and [v]i is the ith element of a vector v. To apply ADMM algorithm, we introduce the auxiliary variable z ∈ ℝp(p−1)/2 to write (3.1) as follows:

minβ,z12y-Xβ22+i=1p(p-1)/2pλ([z]i),         subject to Dβ=z.

The scaled version of augmented Lagrangian of (3.2) is given by

minβ,z12y-Xβ22+i=1p(p-1)/2pλ(zi)+ρ2Dβ-z+u22-ρ2u22.

The corresponding ADMM updating equations for β, z, and u are

βt+1=arg minβ12y-Xβ22+ρ2Dβ-zt+ut22,zt+1=arg minzi=1p(p-1)/2pλ([z]i)+ρ2Dβt+1-z+ut22,ut+1=ut+Dβt+1-zt+1.

Since (3.5) is straightforward, it is crucial to solve (3.4) and (3.5). First, we note (3.3) for β is a ridge-regression and has a closed-form solution as follows.

βt+1=arg minβ12y-Xβ22+ρ2Dβ-zt+ut22=(XTX+ρDTD)-1(XTy+ρDT(zt-ut)).

Next, (3.4) for z involves the penalty function pλ, and its solution differs according to the choice of penalty functions.

LASSO. For the LASSO penalty, i.e., pλ(θ) = λθ for θ ≥ 0, (3.4) becomes

zt+1=arg minzλz1+ρ2Dβt+1-z+ut22=Sλρ(Dβt+1+ut),

where S λ/ρ(a) = {S λ/ρ(ai) = sign(ai)[aiλ/ρ]+, i = 1, 2, . . . , dim(a)} denotes an element-wise soft-thresholding operator.

Elastic Net. The elastic net penalty is defined as pλ(θ) = λ{αθ + (1 − α)/2θ2} for a given α ∈ [0, 1]. Now, (3.4) with the elastic net is

zt+1=arg minzλ1z1+λ2z22+ρ2Dβt+1-z+ut22=Sλ1ρ(11-2λ2/ρ(Dβt+1+ut)),

where λ1 = αλ and λ2 = λ(1 − α)/2

pλ(θ)=λ{I(βλ)+[aλ-θ]+(a-1)λI(θ>λ)},         θ0

for a given a > 2. A popular choice of a is 3.7 (Fan and Li, 2001). The SCAD penalty is non-convex and (3.4) with the SCAD penalty is not as simple as the previous two convex penalties. There are several ways to solve the SCAD-penalized least square problem. In this article, we propose to apply the local linear approximation (LLA) (Zou and Li, 2008). The LLA algorithm employs the first order Taylor expansion to approximate the SCAD penalty as follows.

pλ(θ)pλ(θ0)+pλ(θ0)(θ-θ0),         for θθ0.

The LLA algorithm using (3.10) is known to be best convex approximation of the SCAD penalty function, and one-step update is enough to have a sensible estimator (Zou and Li, 2008). Employing (3.10) to approximate the SCAD penalty pλ(|zi|) at around zizit, (3.4) with SCAD penalty can be updated as follows.

zt+1=arg minzi=1p(p-1)/2pλ(zit)zi+ρ2Dβt+1-z+ut22=Swtρ(Dβt+1+ut),

where [Swt/ρ(Dβt+1+ut)]i=Swit/ρ([Dβt+1]i+uit) with [wt]i=wit=pλ(zit) for i = 1, 2, . . . , p(p−1)/2.

MCP. MCP is another popular non-convex penalty and defined through the derivative as follows:

pλ(θ)=[λ-θa]I(0θaλ),         θ0

LLA algorithm for (3.4) with MCP yields an identical solution (3.11), but with a differently defined wit=pλ(zit) from (3.12).

We remark that (3.4) for updating z can be expressed as a proximal operator of the penalty function P, where P(z)=i=1p(p-1)/2P([zi]).

zt+1=proxPρ(Dβt+1+ut).

Finally, Algorithm 1 provides a summary of the ADMMA algorithm to solve (1.1) with four popular penalty functions.

### 3.1. Grouping Coefficients

It is well-known that the ADMM algorithm does not produce an exact zero for the penalized estimator (Andrade et al., 2021). That is, β̂iβ̂j is not exactly zero for all i < j in our setting, where β̂i and β̂j denote the solution obtained by the proposed ADMM algorithm. However, the estimated auxiliary variable is sparse, and therefore one can perform a coefficient grouping based on the sparsity structure of (Chi and Lange, 2015).

Figure 1 illustrates the grouping results based on the sparsity of , with the LASSO penalty as λ varies, for the simulated data used in Section 5.1 (with n = 1000 and r = 1). The color indicates the true class. As the λ increases, the effect of the grouping penalty increases, and the number of resulting groups decreases. Note that the grouping result looks promising for a properly selected λ (= 2 in this example).

4. Extensions to large-scale data

This section describes two additional advantages of the use of the ADMM algorithm: distributed computation and real-time update, both of which are attractive features in the era of bigdata.

### 4.1. Consensus ADMM for Distributed Computing

For the distributed computation, we consider the global consensus version of (3.1). Suppose the data is divided into several, say K groups, i.e., y = (y1 : y2 : · · · : yK) and X = (X1 : X2 : · · · : XK). Then the global consensus problem of (3.1) is

minβk,z12k=1Kyk-Xkβk22+i=1p(p-1)/2pλ([z]i)subject to         Dβk-z=0         k=1,,K

where βk ∈ ℝp is the local parameters for (yk,Xk), k = 1, 2, . . . , K and z is a common global variable.

The ADMM algorithm is a natural choice to solve the global consensus problem (4.1). We call it consensus ADMM algorithm. The consensus ADMM for (4.1) consists of the following updating equations.

βkt+1=arg minβkyk-Xkβk22+ρ2Dβk-zt+ukt22,         k=1,,K,zt+1=arg minzi=1p(p-1)/2pλ([z]i)+Kρ2Dβ¯t+1-z+u¯t22,ukt+1=ukt+Dβkt+1-zt+1,         k=1,,K,

where uk denotes the scaled dual variable corresponding to (yk,Xy), k = 1, 2, . . . , K, and u¯=k=1Kuk/K and β¯=k=1Kβk/K.

The consensus ADMM for (4.1) is a canonical example of the optimization problems with distributed objective and constraint terms across multiple processors or computers. Each processor only has to deal with its own constraint optimization problem which can be separately updated at each iteration. However, the quadratic regularization term Dβk-z+uk22 forces the individual parameters βk converging to a common value z, which is the solution of the global problem.

The primal and dual residuals in the consensus ADMM are respectively given by

rt+1=(Dβ1t+1-zt+1,,DβKt+1-zt+1)         and         st+1=-ρ(zt+1-zt,,zt+1-zt),

and the algorithm is terminated if both rt22 and st22 are sufficiently small.

### 4.2. Real-time update for streamed data

Now, we assume stream data where we continuously collect data. For handling such stream data, it is desired to update the solution without using all data previously accumulated. This is called real-time update which becomes crucial technique to handle bigdata often collected in a continuous manner.

Suppose we have a set of old data . Let β̂O, zO, uO denote the solution of (3.2) for the old data denoted by . Now, we have a new set of data denoted by . The goal of real-time update is to compute the solution of (3.1) for the whole data, without saving whose sizes increases as more data accumulated.

We remark that ADMM algorithm naturally yields the real-time update for (3.1). This is because the ADMM algorithm updates loss and penalty terms separately. Note that we only require data for updating β, (3.3) which can be expressed as

βt+1=(XWTXW+ρDTD)-1(XWTyW+ρDT(zt-ut))=(XOTXO+XNTXN+ρDTD)-1(XOTyO+XNTyN+ρDT(zt-ut)).

We thus need to store XOTXO and XOTyO to update βt+1 for . We remark that the size of both XOTXO and XOTyO depends only on p, not the sample size of . That is, the proposed algorithm is scalable.

5. Simulation

In this section, we conduct a simulation study to investigate the performance of the proposed method.

### 5.1. Estimation Performance

We first compare the performance of the different penalty function of the PD penalized problems. First, we generate data from the linear regression model

yi=xiTβ+ɛi,         i=1,,n,

where xi are generated from the mean-zero p-variate normal distribution with a first-order autoregressive structure, i.e. Corr(xi j, xik) = 0.5|jk| (1 ≤ j, kp), and εi are from the standard normal distribution. We consider n = 200, 500, 1000 with p = 40. The true regression coefficients β = (β1, . . . , β40) consist of four groups according to their size, each of size 10. We set (group 1) βj = −2r for j = 1, 2, . . . , 10; (group 2) βj = −r for j = 11, 12, . . . , 20; βj = r for (group 3) j = 21, 21, . . . , 30; and (group 4) βj = 2r for j = 31, 32, . . . , 40 where r controls the signal size. We set r = 0.5, 1 for weak and strong signal cases, respectively. We select tuning parameter λ that maximizes Bayesian information criterion (BIC).

We measure the estimation performance of β̂ using mean squared error (MSE). For the grouping performance, we use the normalized mutual information (NMI) (Ke et al., 2015) that measures the distance between the estimated and true group structure (i.e., corresponding group labels). NMI is a standardized measure taking a value between 0 and 1, and NMI = 1 indicates that the two grouping results are identical. See Section 5 of Ke et al. (2015) for the details about NMI.

Figure 2 shows the comparison results when n = 200. The upper panels show the case when r = 0.5 and the lower panels show the case when r = 1. The unpenalized OLS estimator does not perform coefficient grouping while rest estimators with the pairwise difference penalty can identify the true cluster of coefficients. One can observe that the two non-convex penalties, SCAD and MCP, outperform all others in terms of coefficient grouping measured by NMI. The elastic net penalty performs better than the LASSO in this simulation, which is not surprising since the predictors are correlated. The unpenalized OLS is not bad in terms of estimation error, but the two non-convex penalties are still winners. Between the two, MCP seems slightly better than SCAD. The results for larger sample sizes n = 500 and 1000 are nearly identical and hence omitted to avoid redundancy.

### 5.2. Real-time update

In what follows, we illustrate the computational efficiency of the real-time update. Toward this, the data are generated in the same way as the previous simulation. To mimic the streamed-data scenario, we assume that the data arrives in batches. We generate 100 batches with 10000 observations in each batch. We then compare the computation times for the real-time ADMM introduced in Section 4.2 to the original ADMM algorithm applied to the whole data, . Figure 3 depicts the empirical computing time averaged over ten repetitions for the two methods. As more data arrive, the original algorithm takes longer to estimate parameters while the proposed real-time algorithm exhibits stable computing times regardless of the total data size. What matters in the real-time algorithm is the batch size, which is constant in this streamed-data setting. In conclusion, the computational efficiency of the real-time ADMM becomes substantial as more data is accumulated.

6. Conclusion

To solve penalized regression problems with the pairwise-difference penalty for coefficient grouping, we propose using the ADMM algorithm. The proposed ADMM algorithm not only makes optimization much easier by updating parameters in an alternative way, but it is also general enough to handle a wide range of penalty functions. Furthermore, it naturally leads to the extension of distributed computing and real-time updates, both desirable for handling large amounts of data.

Although we focus on the conventional regression problem with the squared loss problem in this article, we can extend it to other popular problems such as generalized linear model and quantile regression. The ADMM algorithm is still powerful as long as updating the regression coefficient without penalty is simple.

Figures
Fig. 1. Illustration of Coefficient-Grouping via () with the LASSO penalty for different values of .
Fig. 2. NMI and MSE values for selected models in experiments where n = 200 under independent 50 simulations.
Fig. 3. Comparison of computing times between the original versus real-time ADMM algorithm. The range of computing times over ten repetitions is represented by shaded areas.
TABLES

### Algorithm 1

 1: initialize β0 ← 0p, z0 ← 0p(p−1)/2, u0 ← 0p(p−1)/2 2: repeat 3: Update βt+1 from (3.6) 4: Update zt+1 from 5: – LASSO: (3.7) 6: – Elastic Net: (3.8) 7: – SCAD: (3.11) with wit based on (3.9). 8: – MCP: (3.11) with wit based on (3.12). 9: Update ut+1 from (3.5) 10: rt+1 = Dβt+1 − zt+1, st+1 = −ρDT (zt+1 − zt) 11: until‖rt+1‖22≤ɛ1,‖st+1‖22≤ɛ2 for sufficiently small ε1 and ε2 given, as suggested in (2.9).

References
1. Andrade D, Fukumizu K, and Okajima Y. (2021) Convex covariate clustering for classification. Pattern Recognition Letters, 151, 193-199.
2. Bondell HD and Reich BJ (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64 (1), 115-123.
3. Boyd S, Boyd SP, and Vandenberghe L (2004). Convex optimization, Cambridge university press.
4. Boyd S, Parikh N, and Chu E (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers, Now Publishers Inc..
5. Chi EC and Lange K (2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24 (4), 994-1013.
6. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96 (456), 1348-1360.
7. Harchaoui Z and Lévy-Leduc C (2010). Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association, 105 (492), 1480-1493.
8. Jeon J-J, Kwon S, and Choi H (2017). Homogeneity detection for the high-dimensional generalized linear model. Computational Statistics & Data Analysis, 114, 61-74.
9. Ke ZT, Fan J, and Wu Y (2015). Homogeneity pursuit. Journal of the American Statistical Association, 110 (509), 175-194.
10. Parikh N and Boyd S (2014). Proximal algorithms. Foundations and Trends in optimization, 1 (3), 127-239.
11. Shen X and Huang H-C (2010). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 105 (490), 727-739.
12. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58 (1), 267-288.
13. Tibshirani R, Saunders M, Rosset S, Zhu J, and Knight K (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (1), 91-108.
14. Wahlberg B, Boyd S, Annergren M, and Wang Y (2012). An ADMM algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45 (16), 83-88.
15. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38 (2), 894-942.
16. Zhu Y (2017). An augmented ADMM algorithm with application to the generalized lasso problem. Journal of Computational and Graphical Statistics, 26 (1), 195-204.
17. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B, 67 (2), 301-320.
18. Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics, 36 (4), 1509-1533.