TEXT SIZE

search for



CrossRef (0)
Comparison of algorithms for estimating regression splines using lasso penalty
Communications for Statistical Applications and Methods 2025;32:21-46
Published online January 31, 2025
© 2025 Korean Statistical Society.

Eun-Ji Leea, Jong-Beom Parka, Dong-Young Leea, Jae-Hwan Jhong1,a

aDepartment of Information Statistics, Chungbuk National University, Korea
Correspondence to: 1 Department of Information Statistics, Chungbuk National University, Chungdae-ro 1, Seowon-gu, Cheongju, Chungbuk 28644, Korea. E-mail: jjh25@chungbuk.ac.kr
Received July 3, 2024; Revised September 5, 2024; Accepted October 2, 2024.
 Abstract
The Lasso regression model is a penalty-based model that allows for variable selection from high-dimensional data. When Lasso regression is combined with a truncated power spline, this can be considered a knot selection problem from a nonparametric regression perspective. Using three algorithms—coordinate descent algorithm, quadratic programming, and alternating direction method of multipliers—, we compare the performance in fitting Lasso regression spline models. Additionally, we present a comparative experiment on the convergence of the coordinate descent algorithm according to the maximum number of iterations and the stopping threshold value in the glmnet function of R. Also, we compare the convergence rate of the algorithm under the various penalty parameter of the alternating direction method of multipliers with the coordinate descent algorithm. Through simulations and two real data analyses, we conduct numerical studies to verify the performance of the algorithms. By comparing their performance under various conditions, this study ultimately provides recommendations for the most suitable algorithm for di erent situations. The Lasso regression model is a penalization technique that facilitates variable selection from high-dimensional data. When combined with a truncated power spline, Lasso regression addresses the knot selection problem from a nonparametric regression perspective. We compare the performance of three methods—coordinate descent, quadratic programming, and the alternating direction method of multipliers—in fitting Lasso regression spline models. Through simulations and analyses of two real datasets, we conduct numerical studies to evaluate the performance of these methods. By comparing their performance under various conditions, this study ultimately o ers recommendations for the most suitable algorithm for di erent scenarios.
Keywords : alternating direction method of multipliers, coordinate descent algorithm, knot selection, lasso penalty, quadratic programming
1. Introduction

With the advent of the big data era, the amount and variety of data are increasing rapidly, making it difficult to estimate the true function. When trying to determine which variables influence a specific value of interest and to what extent, one of the representative statistical methods is the Lasso (least absolute shrinkage and selection operator) regression model (Tibshirani, 1996). This model adds a penalty term to the residual sum of squares in a traditional regression model, facilitating variable selection. In a truncated power spline regression model, if an 1-norm penalty is imposed on the parameter corresponding to the basis, it can be interpreted as a knot selection problem. When data is given, the spline regression model is important in selecting a significant knot among knot candidates within the range of the predictor, just as variables that influence the response variable are selected through statistical techniques considering all variables. Therefore, this paper focuses on knot selection rather than variable selection, adapting the Lasso method to the context of knot selection and enabling the identification of significant knots.

Much research has been conducted on spline regression models for knot selection (Osborne et al. (1998), Jhong et al. (2017), Jhong and Koo (2019), Oh and Jhong (2021), Lee and Jhong (2021), Lee and Jhong (2022), Lee et al. (2022)). In particular, Osborne et al. (1998), Lee and Jhong (2021), and Lee and Jhong (2022) applied a Lasso penalty and used a truncated power spline as a spline.

Among the various methods to address Lasso regression problems, the alternating direction method of multipliers (ADMM) (Boyd et al., 2011), the coordinate descent algorithm (CDA) (Wright, 2015), and quadratic programming (QP) (Goldfarb and Idnani, 1983) are widely recognized for their efficiency. In this study, we apply the methods to build a Lasso regression model based on truncated power splines and identify which conditions and algorithms are the most efficient in terms of running time and suitability. The methods being compared include ADMM, CDA, and QP. The ADMM algorithm is particularly effective in handling large-scale Lasso regression problems. The CDA has proven to be very efficient for managing high-dimensional data. On the other hand, the QP approach, as detailed in the work of Osborne et al. (2000), leverages the structure of quadratic problems to provide precise solutions for Lasso regression. Despite the availability of these advanced methods, there remains a need for comprehensive comparisons to guide the selection of the most appropriate algorithm for specific situations. By evaluating their performance in terms of suitability and running time, we aim to provide recommendations for their use in different contexts. The results of this comparison will highlight the strengths and weaknesses of each algorithm and offer valuable insights for future research in Lasso regression modeling.

The ADMM algorithm and CDA are directly implemented using the statistical program R, while the iterative algorithm for QP is coded using the solve.QP function in the quadprog package of Weingessel (2011). Additionally, the study utilizes the glmnet function from the R package glmnet, which employs CDA, altering various simulation conditions under the same settings for the four functions. We also examine the fit of truncated power spline regression with a Lasso penalty using two real datasets.

This paper is organized as follows: In Section 2, we present the Lasso regression splines model. The algorithms for comparison are introduced in Section 3, including the glmnet package, which uses the CDA method. In Sections 4 and 5, we validate the performance of our proposed model using simulations and real data. Finally, we summarize the conclusions of this study in Section 6. All numerical analyses were conducted using the statistical software R.

2. Lasso regression splines model

For i = 1, …, n, consider the multiple regression model

yi=j=1pδjxij+ɛi,         ɛi~N(0,σ2),

where {ɛi}′s are independent and are from identical distribution. Here, yi is ith sample’s response and xi j is ith sample’s jth predictor. Also, δj is the parameter related to jth predictor. Note that for convenience, assuming that yi is centered, the estimate of the intercept term is 0, so it is omitted. When building a linear regression model using high-dimensional data with many variables, there is a high possibility of numerical instability, making it difficult to find a unique solution for the regression coefficient. In these cases, the Lasso penalty is generally considered. Lasso penalty can be used to build sparse models, which can increase interpretability. The Lasso estimator can be expressed by constraints or can also be expressed by the Lagrange multiplier method. Let δ̂lasso is (δ^1lasso,,δ^plasso), then the former is

δ^lasso=argminδpi=1n(yi-j=1pδjxij)2         subject to         j=1pδjt,

where t > 0 and the latter is

δ^lasso=argminδp{i=1n(yi-j=1pδjxij)2+λj=1pδj},

where λ > 0. As the tuning parameter t decrease and λ increase, the estimated regression coefficient shrinks to 0.

Meanwhile, when given paired data {(xi,yi)}i=1n, we would like to estimate a function f representing the relationship between xi and yi. Consider a nonparametric regression model

yi=f(xi)+ɛi         for         i=1,,n,

where x = (x1, …, xn) ∈ [0, 1]n are predictors, y = (y1, …, yn) ∈ ℝ n are responses, and ɛ1, …, ɛn are independent errors with mean zero and a positive variance.

To estimate f, we use truncated power spline basis with K knots t0 = 0 < t1 < ··· < tK < tK+1 = 1

f(x;α,β)=α0+j=1dαjxj+k=1KβkPd,k(x),

where α = (α0, α1, …, αd) and β = (β1, …, βK). If d = 0,

P0,k(xi)={1,if         xi[tk-1,tk)0,otherwise.

For d > 0,

Pd,k(xi)=(xi-tk)+d.

Note that (a)+ is a function that assumes a real value a if a > 0, and 0 otherwise. The number of the initial knots K for the splines is selected and then placed at the equal interval within (0, 1).

Model (2.4) is a form of basis expansion using a truncated power spline basis when a univariate x is given. If these bases are considered as variables, model (2.4) is equivalent to a multiple regression model (2.1). Just as we can select significant variables by applying the Lasso penalty, we select “significant knot(s)” by applying the Lasso penalty in equation (2.5).

The empirical risk of (α, β) is defined as

R(α,β)=12ni=1n(yi-f(xi;α,β))2.

The objective function to be minimized is given by

Rλ(α,β)=R(α,β)+λβ1

for λ > 0 and ||·||1 is 1-norm. Let

(α^,β^)=argminα,βRλ(α,β),

and then our goal is to estimate lasso regression model f based a truncated power spline.

f^=f(·;α^,β^).

In equation (2.3), if δ̂j = 0, the corresponding predictor xj is not considered in the model. In other words, it can be examined from a variable selection perspective. We can select knots by applying a Lasso penalty to the regression coefficients βk’s associated with each knot in equation (2.5). Similarly, in equation (2.6), if β̂k = 0, the corresponding interior knot is not selected. Therefore, just as the simplicity of the model is maintained in the multiple regression model (2.1), the smoothness of the function f, which is the target of estimation, can be adjusted in equation (2.4).

3. Method

In this study, we apply alternating direction method of multipliers (ADMM), coordinate descent algorithm (CDA), and quadratic programming (QP) to solve the problem (2.6).

3.1. ADMM

The ADMM, which is a relatively newer methodology compared to CDA and QP methodologies, is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers. To apply ADMM, the optimization problem (2.6) can be expressed in vector-matrix form using new constraints as follows.

min{12ny-Xα-Pβ22+λγ1}         subject to         β=γ,

where ||·||2 is 2-norm, γ = (γ1, …, γK) ∈ ℝK,

X=[1xx2xd]n×(1+d),

and

P=[Pd,1(x)Pd,K(x)]n×K.

X is a matrix related to polynomial terms, and P is a matrix generated by a truncated power spline. Then the augmented Lagrangian is

L(α,β,γ,u)=12ny-Xα-Pβ22+λγ1+ξ2β-γ+u22,

where ξ > 0.

Then, ADMM steps are as follows. For m = 1, …, M,

α˜(m)(XX)-1(Xy-XPβ˜(m-1))β˜(m)(1nPP+ξI)-1(1nP(y-Xα˜(m))+ξ(γ˜(m-1)-u˜(m-1)))γ˜(m)ST(β˜(m)+u˜(m-1),λ/ξ)u˜(m)u(m-1)+β(m)-γ(m),

where M is the number of max iteration. Also, α̃(m), β̃(m), γ̃(m), and ũ(m) means the mth updated values of α, β, γ, and u. The vector u ∈ ℝK is an additional parameter required when updating.

Herein, we stop iteration when

Rλ(α˜(m),β˜(m))-Rλ(α˜(m-1),β˜(m-1))<ɛ,

where ε is a sufficiently small positive number. Additionally, ST(·) is a soft-thresholding operator (Friedman et al., 2010) and is defined as follows:

For λ > 0 and a ∈ ℝ,

ST(a,λ)={a+λ,if a<-λ0,if aλa-λ,if a>λ.

Meanwhile, the convergence rate of the algorithm depends on the value of the penalty parameter ξ (Boyd et al., 2011; Xu et al., 2017; McCann and Wohlberg, 2024). If the penalty parameter value is too large, the proportion of minimizing the objective function will be reduced, and if it is too small, the constraints will not be well satisfied.

In our case, we set the following update steps.

r(m)=β(m)-γ(m)s(m)=ξ(m)(γ(m)-γ(m-1))ξ(m+1)={2ξ(m),if r(m)2>10s(m)2ξ(m)/2,if s(m)2>10r(m)2ξ(m),otherwise,

where r(m) is the primal residual at mth iteration and s(m) is the dual residual at mth iteration. Note that α̃(0), β̃(0), γ̃(0) and ũ(0) are vectors with an element value of 0, which are the initial values of α, β, γ and u, respectively. Also, the initial ξ, ξ(0) is set λ.

3.2. CDA

The CDA optimizes the objective function with respect to a single coefficient at a time, iteratively cycling through all coefficients until convergence is reached.

For j = 0, 1, …, d and k = 1, 2, …, K, let α̃j and β̃k denote the initial/updated values. Specifically,

α˜=(α˜0,α˜1,,α˜d)1+d,         β˜=(β˜1,,β˜K)K.

For notation convenience, we define

θ˜=(α˜,β˜)=(θ˜1,,θ˜,,θ˜1+d+K)1+d+K.

Thus, the coordinate descent update is written in the following forms:

θ˜argminθRλ(θ˜1,,θ˜-1,θ,θ˜+1,,θ˜1+d+K)         for         =1,,1+d+K.

Finally, α and β are updated as follows.

αj1ni=1nri,α(j)xj1ni=1n[xij]2         for j=0,1,,dβkST(1ni=1nri,β(k)Pd,k(k)(xi),λ)1ni=1n[Pd,k(xi)]2         for k=1,2,,K

It is noteworthy that xi0=1. Here, ri,α(j) and ri,β(k) are the partial residual for α and β, respectively, and

ri,α(j)=yi-jα˜xi-k=1Kβ˜kPd,k(xi)ri,β(k)=yi-j=0dα˜jxij-kβ˜Pd,(xi).

If the difference between the objective functions of the current iteration and the previous iteration is less than _, then the iteration stops.

3.3. QP

The QP is one of the methods for solving mathematical optimization problems involving quadratic functions. Unlike the ADMM algorithm and CDA, it has the characteristic of being able to obtain a solution in single step without repeating the algorithm.

For c ∈ ℝ,

c+={c,ifc00,ifc<0,         and         c-={-cifc00ifc>0

Then, we can represent that c = c+c and |c| = c+ + c. By reparameterizing βk,

βk=(βk)+-(βk)-,         βk=(βk)++(βk)-.

Therefore, the optimization problem is equivalent to minimizing

12ni=1n(yi-j=0dαjxj-k=1K{(βk)+-(βk)-}Pd,k(xi))2

subject to

k=1K{(βk)++(βk)-}t,

in addition to 2 × K non-negativity constraints

(βk)+0,         (βk)-0         for         k=1,,K.

Then,

Z=[XP-P]n×(1+d+2K),

and

b=[αβ+β-]1+d+2K,

where β+ = {(βk)+} and β = {(βk)}. Thus,

R(θ)=12n(y-Zb)(y-Zb)=12nyy-1nyZb+12nb(ZZ)b.

Therefore, the minimization problem is equivalent to

minb(-1nyZb+12nb(ZZ)b)

subject to

[0(1+d)×10(1+d)×K0(1+d)×K-1K×1IK×K0K×K-1K×10K×KIK×K][αβ+β-][-t0K×10K×1].

Note that in the QP method, unlike the ADMM algorithm and CDA, when estimating the parameters α and β, the optimization problem (2.6) is transformed into a problem that takes constraints into account, as shown in (2.2). Specifically, the constraints of the optimization problem (3.1) are as follows:

[-Σk=1K(βk)+-Σk=1K(βk)-β+β-][-t0K×10K×1].

The first constraint corresponds to the Lasso penalty, and the second and third constraints correspond to non-negativity.

To solve the optimization problem (3.1), the solve.QP function of the quadprog package is used in the statistical program R. Dmat, dvec, bvec and Amat are required as arguments for this function. This function solves the optimization problem (− dvecx + 1/2xDmatx) with respect to x having constraints Amatxbvec. Therefore, according to the arguments, to solve the optimization problem (3.1), our optimization equation can be expressed as follows:

Dmat1n(ZZ),         dvec1nyZ,         bvec[-t02K×1]

and

Amat[0(1+d)×10(1+d)×K0(1+d)×K-1K×1IK×K0K×K-1K×10K×KIK×K]

Finally, the solution of b is obtained as

b^=[α^β^+β^-],

and the estimator is expressed as

β^k=(β^k)+-(β^k)-         for         k=1,,K.

3.4. Algorithm details

We implement the algorithms using the R program. To efficiently select tuning parameters, find λmax, which is the smallest λ among λ’s that assumes all β̂k to be 0. This is because as λ increases, β̂k shrinks to 0, and if it increases even a little more than λmax, all β̂k = 0 are the same. In our case, the formula for calculating λmax is as follows (O’Brien, 2016).

λmax=maxk1nPd,k(x),y

where x,y=Σi=1nxiyi. Thereafter, we generate the λ candidates. After setting λratio, a sufficiently small number, and nλ, the number of λ candidates, it is split into equal intervals by nλ from log(λmax) to log(λmax × λratio). Next, we transform the candidates back to exponential scale, and replace the last smallest candidate value with 0. When the process of finding λ candidates is completed, we estimate the nλ models using the ADMM algorithm and CDA.

Here, we specify the maximum number of iterations M and repeat until the convergence criteria are satisfied. After convergence or M iterations, the best model is selected among the nλ models generated. In statistical data analysis, the criteria for what is “best” exist. Some representative information criteria are, for example, the Akaike information criterion (AIC) (Bozdogan, 1987), the Bayesian information criterion (BIC) (Schwarz, 1978), and cross-validation (CV). The AIC and CV involve the selection of a model with a relatively large variance and small bias compared with the BIC. In general, as the sample size n increases, BIC presents a more sparse model than AIC (Anderson and Burnham, 2004). In addition, BIC can be more effective in selecting the optimal model as the sample size increases because it helps prevent overfitting, and does this better than AIC. Because the purpose of our study is to select a concise model among models that vary depending on tuning parameters, we use BIC, which is defined as follows:

BIC=nlog(R(θ^))+pactlog (n),

where pact is the number of non-zero β̂k’s. We selected the best model that minimized the BIC. Note that Stone (1977) showed that asymptotically minimizing the AIC is equivalent to minimizing the CV value.

Meanwhile, in QP, t is considered as the tuning parameter. Contrary to λ, as t increases, β̂k approaches least squares estimate (LSE). Therefore, tmax is obtained similarly to λmax.

tmax=k=1K|β^kLSE|

Similarly, we determine the number of candidates t, nt and tmax, and the sufficiently small number tratio. Afterwards, it is divided into equal intervals by nt from log(tmax) to log(tmax × tratio), then converted to exponential scale, and returned to the existing scale.

4. Simulations

In this section, we analyze the performance using simulated examples based on model (2.1) to compare the four algorithms shown in Section 3: For convenience, we generate the predictor as n sequence and the K knots at regular intervals in the range [0, 1]. ɛi is generated from N(0, σ2) for i = 1, …, n. σ2 is set differently for each example.

The stopping threshold value ε is utilized to determine the convergence criterion of the algorithm. We set this to 10−5, to ensure that the algorithm converged sufficiently. M is employed to establish the maximum number of iterations for the algorithm. By setting a limit on the number of iterations, we prevent the algorithm from running indefinitely in cases where it does not converge. The λratio parameter determines the range of λ values by setting the ratio between the maximum and minimum λ values. This adjustment ensures that the regularization parameter is neither excessively large nor too small. The nλ parameter specifies the number of λ values to be used.

We employ 100 λ values in total to explore a wide range of regularization strengths. We generated models for the various λ values and evaluated their performance based on BIC to select the optimal model. In addition, the simulation is repeated 100 times for each example.

To implement the four algorithms, the statistical program R is used. ADMM and CDA are directly coded, and glmnet used the glmnet function of the glmnet package. In R, glmnet is a package (Friedman et al., 2010; Simon et al., 2013; Friedman et al., 2021; Tay et al., 2023) that fits generalized linear and similar models via penalized maximum likelihood. In this package, glmnet is a function that estimates coefficients using CDA. However, since it is different from the stopping rule considered in ADMM, we also use the glmnet function to compare the performance of the algorithm. Finally, in order to implement the algorithm using QP, parameters are estimated using the solve.QP function of the quadprog package.

Four examples are given as follows:

  • Example 1. Constant function

    f(x)=hjK(x-xj)K(c)=(1+sgn(c))/2,         csgn(c)={1ifc>00ifc=0-1ifc<0xj=(0.1,0.13,0.15,0.23,0.25,0.40,0.44,0.65,0.76,0.78,0.81)hj=(4,-5,3,-4,5,-4.2,2.1,4.3,-3.1,5.1,-4.2)

  • Example 2. Linear function

    f(x)={-3x+3.0if0.0x<0.33x+1.2if0.3x<0.5-4x+4.7if0.5x<0.8x+0.7if0.8x1.0

  • Example 3. Simple cubic function

    f(x)=cos(2πx2)

  • Example 4. Complex cubic function

    f(x)=(x(1-x))1/2sin(2π(1+τ)/(x+τ)),         τ=0.05

Figure 1 represents true functions of Example 1–4, respectively. The top-left plot is for Example 1, which has a constant shape. The top-right plot is for Example 2, which has a continuous fragment linear shape. The bottom-left plot is for Example 3, which has a simple cubic shape using trigonometric function. The bottom-right plot is for Example 4, which has a complex cubic shape. Example 1 and Example 4 utilize functions of Donoho and Johnstone (1994).

To compare the performance of models, we consider the mean squared error (MSE), the mean absolute error (MAE), and the maximum deviation (MXDV) criterion as loss functions. These criteria measure the discrepancy between the true function f and each function . They are expressed as follows.

MSE(f^)=1ni=1n(f(xi)-f^(xi))2,         MAE(f^)=1ni=1n|f(xi)-f^(xi)|and         MXDV(f^)=max1in|f(xi)-f^(xi)|.

In addition, we use running time to compare the performance of the models. Comparing running times can help in selecting the most optimal algorithm for each set of different conditions.

In Example 1, n is set to 200 and 400, K to 30 and 40, σ to 0.01 and 0.2, ε to 10−5, and M to 10, 000. For glmnet, ε is set to 10−7, and M to 105 by default. Figure 2 shows that the fit of all four algorithms is almost similar to true f. Table 1 shows that the values of MSE, MAE, and MXDV are similar. The fit is similar, however, there is a difference in terms of running time. The running time is long in the order of glmnet < QP < ADMM << CDA. In particular, in the case of CDA, it can be seen that it exceeds 10,000 seconds. Note that the function used in CDA is R user-defined function that implements the CDA update that we derived, whereas glmnet is a function that allows users to easily access Fortran-based code that is highly efficient in numerical calculations in R. That is why glmnet is at least about 86 times faster than CDA in terms of the convergence rate even though it uses the same algorithm. Additionally, CDA user-defined function and glmnet function have a different stopping rule than in ADMM and CDA. For these reasons, it is meaningful to compare whether the two methods are equivalent in terms of fit and the algorithm convergence rate even if the same algorithm is used.

In Example 2, n is set to 200 and 400, K to 20 and 40, σ to 0.05 and 0.15, ε to 10−5, and M to 10, 000. Looking at Figure 3, the four algorithms appear to fit well, and they seem to estimate the true function better than in Example 1. It is confirmed that the performance indicators of MSE, MAE, and MXDV are all good and similar as seen in Table 2. As in Example 1, the running time is long in the order of glmnet < QP < ADMM << CDA.

In Example 3, n is set to 200 and 400, K to 20 and 40, σ to 0.05 and 0.2, ε to 10−5, and M to 10, 000. Looking at Figure 4, unlike the previous examples, the fitting performance of QP clearly differs significantly. In Table 3, the MSE value of QP is actually about ten times larger than the MSE values of the other algorithms. The MAE and the MXDV of QP are also about five times larger than the MAE and the MXDV values to the other algorithms. In Figure 4, it can also be observed when x ∈ [0.9, 1.0] that glmnet does not fit well. Therefore, the MSE, MAE, and MXDV values of glmnet are slightly higher compared to those of ADMM and CDA. In terms of running time, similar to the previous examples, CDA takes the longest, followed by ADMM, glmnet, and QP.

In Example 4, n is set to 200 and 400, K to 20 and 40, σ to 0.05 and 0.4, ε to 10−5, and M to 10, 000. In Figure 5, distinct differences begin to emerge. When x ∈ [0, 0.1], none of the four algorithms performed perfectly. However, when x ∈ [0.1, 0.5], ADMM shows the best fitting result among the four results. In Table 4, the values of MSE, MAE, and MXDV of ADMM are the smallest among the four algorithms. Both CDA and glmnet performed similarly when x ∈ [0.1, 0.5], with their MSE, MAE, and MXDV values also being at similar levels. On the other hand, QP generally does not perform well, with its values of MSE, MAE, and MXDV being the highest among the four algorithms. Moreover, ADMM does not significantly increase the execution time when it changed from a complex cubic function compared to other algorithms.

Meanwhile, in Example 4, ADMM appears to estimate the actual function well. However, CDA, glmnet, and QP show that the fitting results are far from the actual function. This is because, despite using the same basis, ADMM can control the convergence rate with a penalty parameter. In this regard, Figure 6 shows a natural characteristic of using truncated power splines. Figure 6 is a plot expressing the basis created when degrees are 0, 1, 2, and 3, respectively. As the degree increases, the overlap between the basis increases in the plot at large values of support. For this reason, even in the process of estimating the solution, there is correlation between basis, which causes convergence to take a long time or causes problems in estimating the solution.

Note that Figure 7 shows the glmnet fitting results according to the two factors: M and ε for coordinate descent. Figure 7 shows that as M increases and ε decreases, the fitted value gradually converges between x values of 0.4 and 0.6. This observation suggests that achieving convergence in glmnet may require more iterations compared to ADMM under certain conditions.

Conversely, if the penalty parameter is adjusted in ADMM, it is an attractive algorithm that converges faster than CDA. Additionally, for a more objective comparison between ADMM and CDA, we compare the MSE change per iteration for 100 repetitions. In the experiment, the same data as in Example 4 was used, and n = 400, ε = 0.4, and K = 40 were set. Meanwhile, M = 500(m = 1, …, 500) and ε = 0 were set to force the algorithm to run without stopping. Under these settings, in addition to the ADMM and CDA with the ξ update method that we used, we also compare the results when ADMM was fixed to a specific constant value (ξ = 10−4, 10−6, 10−8).

In Figure 8, for each repeat r = 1, …, 100, the x-axis represents the mth iteration, and the y-axis represents the MSE for accordingly. The plots were drawn at specific identical λ values for the four ADMM and CDA results. Additionally, the average for 100 lines per method is plotted. First of all, when comparing ADMM(ξ update) and CDA, both algorithms show a fast convergence pattern. Especially, ADMM(ξ update) is absolutely faster than CDA, and when the stopping rule is applied, it actually escapes the iterative algorithm in only 2~3 iterations, whereas CDA shows a monotonically decreasing pattern. Next, to understand the importance of the choice of the penalty parameter, we compare it with the case where ξ is fixed at a specific value instead of updating it as in ADMM. As the ξ value decreases, it shows a pattern of overfitting in the first iteration, and then it gradually converges. Although ADMM(ξ = 10−4) and ADMM(ξ = 10−6) are non-monotonous, ADMM(ξ = 10−4) seems to converge in about the 200th iteration, and ADMM (ξ = 10−6) is also expected to converge after a few more iterations. However, ADMM(ξ = 10−8) seems to need much more iterations to converge. In this example, ADMM(ξ update) shows the best performance. In summary, the convergence rate of ADMM varies greatly depending on ξ, which suggests that the ξ value has a great impact on convergence.

5. Real data analysis

In this section, based on the simulation results, we analyze two real data sets. The first is penny, and the second is WWWusage. Both data sets are R built-in data sets.

5.1. Penny data

We utilize the penny dataset included in the locfit package (Loader and Liaw, 2024). The penny dataset contains 90 observations, each containing two variables: year and thickness. The year variable is a discrete variable representing the year when the coin was minted, while the thickness variable is a continuous variable representing the thickness of the coin in millimeters. In Figure 9, the x-axis corresponds to the predictor variable, year, and the y-axis corresponds to the response variable, thickness. For the fitting, K = 30, M = 104 and ε = 10−5 are set up for all methods, except for glmnet. The glmnet M is set to 105.

The penny’s thickness is not significantly affected by time and shows little variation. The degree of the function for algorithm estimation is zero. As shown in Figure 9, the plot in the top-left has a general trend of increasing thickness from the mid-1940s to the early 1970s. After that, the thickness decreased to around 53(mm), then gradually increased again. The degree of the function for algorithm estimation is zero in fitting penny data. Using a degree of zero is driven by the simplicity of the dataset. It is more efficient for this simple data structure, minimizing overfitting and enhancing model robustness.

The four plots below in Figure 9, show that, except for glmnet, the other algorithms perform at a similarly moderate level. From the mid-1940s to the early 1970s, they fit the changes of the two large data clusters in a simple shape. From the mid-1970s to 1980s, they don’t capture the sharp decline in the data, instead fitting only the average decrease based on the changes in the data clusters. However, glmnet fits well with the general trend of increasing thickness from the mid-1940s to the early 1970s. Additionally, from the mid-1970s to 1980s, it also fits well for the gradual increase following the sharp decline.

5.2. WWWusage data

We utilize the WWWusage dataset, which is included in the base R package. WWWusage is a time series dataset that tracks the number of users connected to the Internet over a specific period. The WWWusage dataset contains a single variable, usage, which represents the number of users connected to the Internet. The dataset consists of 100 observations, recorded at one-minute intervals. In Figure 10, the x-axis corresponds to the predictor variable, minute and the y-axis corresponds to the response variable, number of users.

As shown in Figure 10, the plot in the top-left has a gradual upward trend until the mid-30th minute. Afterward, the trend decreases until around the 40th minute. This is followed by a sharp increase until the mid-50th minute. Subsequently, a decreasing trend is observed until around the 70th minute. From that point onward, there is a consistent upward trend until the final observation. Overall, compared to the penny dataset, the curve of the WWWusage dataset is much smoother. In the proposed algorithm, min-max normalization is performed on the ‘minute’ variable to avoid errors in numerical calculations, ensuring the predictor variable is within the range [0, 1]. The degree of the function for algorithm estimation is three in fitting WWWusage data. Using a degree of three is driven by the complexity and variability within the dataset. Higher-degree polynomials allow the model to better adapt to the fluctuations and trends. For the fitting, K = 20, M = 104, except for glmnet, and ε = 10−5 are set up. M of glmnet is 106. The degree of the function for algorithm estimation is three. In the four plots below in Figure 10, ADMM provides a good fit overall, closely following the data across almost all intervals. Both CDA and glmnet only capture the general increasing trend before the sharp decline in the late 50-minute range. However, after 70 minutes, both glmnet and CDA provide good fits for the increasing trend. The QP does not fit the rapid fluctuations from the beginning to the early 50-minute range. Just following the increasing trend only up to around 40 minutes, it fails to capture the sharp increase and shows a decreasing trend until the 70-minute mark. After that, it only followed the general shape of the increasing trend.

6. Conclusion

We assumed a nonparametric regression equation to estimate the constant, linear, and cubic functions. To solve the problem (2.6), we adopt various methods: ADMM, CDA, and QP. After generating 100 λ’s and fitting the parameters with each algorithm, the best-fit regression model was selected based on the BIC. This process was repeated 100 times, and their performance indicators, such as average MSE, MAE, MXDV, and running time were compared.

For the simpler forms, the constant and linear functions, the fitting performance of the four algorithms was found to be similar. In terms of running time, glmnet perfomed the best. For the simple cubic function, differences in fitting performance were observed, with glmnet and QP not fitting as well as ADMM or CDA. However, in terms of running time, CDA took approximately 3,000 times longer than ADMM. For the complex cubic function, as the number of knots increased, ADMM performed relatively well in terms of both fitting and running time compared to the other algorithms.

Through the simulation, we found that the CDA algorithm has a slow convergence rate in the case of data with complex nonlinear shapes. It was determined that this was due to an issue of overlapping support that occurred when using a truncated power spline as the basis function. In conclusion, we suggest a direction for selecting appropriate algorithms under various conditions. We recommend using glmnet for simpler forms and ADMM for more complex forms.

Acknowledgement
The research of Eun-Ji Lee was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (RS-2024-00406939). The research of Jae-Hwan Jhong was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT) (RS-2024-00342014, RS-2024-00440787).
Figures
Fig. 1. These plots are examples of constant function(top-left), linear function(top-right), simple cubic function (bottom-left), and complex cubic function(bottom-right).
Fig. 2. Plots for Example 1 when n = 400, = 0.20, K = 40, ε = 10, and M = 10, 000. Gray dots, data points; black line, true function; black point, fitted values from the final iteration.
Fig. 3. Plots for Example 2 when n = 400, = 0.15, K = 40, ε = 10, and M = 10, 000. Gray dots, data points; black line, true function; black point, fitted values from the final iteration.
Fig. 4. Plots for Example 3 when n = 400, = 0.20, K = 40, ε = 10, and M = 10, 000. Gray dots, data points; black line, true function; black point, fitted values from the final iteration.
Fig. 5. Plot for Example 4 when n = 400, = 0.40, K = 40, ε = 10, M = 10, 000. Gray dots, data points; black line, true function; black point, fitted values from the final iteration.
Fig. 6. This figure shows the truncated power spline basis according to degree. The top-left is a constant spline, the top-right is a linear spline, the bottom-left is a quadratic spline, and the bottom-right is a cubic spline.
Fig. 7. This figure shows the fitting results according to and ε (thresh in plot) that determine convergence in the glmnet function.
Fig. 8. This plot shows the results of four ADMMs and CDA according to the ξ values. -axis represents the iteration, and the -axis represents the MSE for accordingly.
Fig. 9. The plot in the top-left represents the penny data. The -axis corresponds to the predictor variable, year, i.e, each year from 1945 to 1989. The -axis corresponds to the response variable, thickness, with units in mm. There are 90 gray data points. The four plots below represent each algorithm fitting the penny data. Gray dots, data points; black lines, fitted lines.
Fig. 10. The plot in the top left represents the WWWusage data. The -axis corresponds to the predictor variable, minute, i.e, each minute from 1 to 100. The -axis corresponds to the response variable, number of users. There are 100 gray data points. The four plots below represent each algorithm fitting the WWWusage data. Gray dots, data points; black lines, fitted lines.
TABLES

Table 1

The average of each criterion for over 100 runs of ADMM, CDA, glmnet and QP in Example 1 with the standard error in parentheses and a sample size n = 200, 400 and σ = 0.01, 0.2

nσEvaluationADMMCDAglmnetQP
2000.01MSE1.646 ×10−11.647×10−11.656×10−11.684×10−1
(5.713 ×10−5)(1.685 ×10−4)(1.464 ×10−3)(9.674 ×10−5)
MAE1.849 ×10−11.871×10−11.930×10−12.079×10−1
(1.311 ×10−3)(2.171 ×10−3)(7.610 ×10−3)(6.363 ×10−4)
MXDV1.731 ×10−11.730×10−11.730×10−11.730×10−1
(4.305 ×10−3)(4.242 ×10−3)4.179 ×10−3)(4.239 ×10−3)
Running time(s)41.013430.2185.3136.920

0.20MSE1.742 ×10−11.757×10−11.734×10−11.778×10−1
(4.029 ×10−3)(4.074 ×10−3)(4.106 ×10−3)(5.727 ×10−3)
MAE2.272 ×10−12.296×10−12.265×10−12.327×10−1
(9.954 ×10−3)(1.025 ×10−2)(1.055 ×10−2)(1.248 ×10−2)
MXDV1.762 ×10−11.759×10−11.758×10−11.759×10−1
(7.193 ×10−2)(7.201 ×10−2)(7.231 ×10−2)(7.195 ×10−2)
Running time(s)44.286453.2755.5076.890

4000.01MSE1.859 ×10−11.862×10−11.865×10−11.885×10−1
(2.590 ×10−6)(4.502 ×10−4)(7.865 ×10−4)(6.532 ×10−5)
MAE1.799 ×10−11.854×10−11.870×10−11.885×10−1
(3.255 ×10−4)(4.151 ×10−3)(6.257 ×10−3)(4.331 ×10−4)
MXDV1.9831.9851.9861.984
(3.326 ×10−3)(3.153 ×10−3)(3.164 ×10−3)(3.150 ×10−3)
Running time(s)62.925759.2646.00415.769

0.20MSE1.927 ×10−11.964×10−11.931×10−11.974×10−1
(3.831 ×10−3)(5.690×10−3)(4.278×10−3)(6.808×10−3)
MAE2.167 ×10−12.240×10−12.183×10−12.259×10−1
(9.614×10−3)(1.204×10−2)(9.627×10−3)(1.367×10−3)
MXDV1.9941.9961.9981.995
(6.398×10−2)(6.306×10−2)(6.287×10−2)(6.308×10−2)
Running time(s)67.7941561.9894.55016.543

The unit of running time is seconds.


Table 2

The average of each criterion for over 100 runs of ADMM, CDA, glmnet and QP in Example 2 with the standard error in parentheses and a sample size n = 200, 400 and σ = 0.05, 0.15

nσEvaluationADMMCDAglmnetQP
2000.05MSE3.095 ×10−42.752×10−43.228×10−44.074×10−4
(6.852 ×10−5)(7.852 ×10−5)(9.008 ×10−5)(1.127 ×10−4)
MAE1.301 ×10−21.145×10−21.290×10−11.323×10−2
(1.734 ×10−3)(2.119 ×10−3)(2.144 ×10−3)(2.227 ×10−3)
MXDV6.460 ×10−26.887×10−26.843×10−28.219×10−2
(9.687 ×10−3)(9.897 ×10−3)(1.109 ×10−2)(1.038 ×10−2)
Running time(s)18.1226060.1215.8383.109

0.15MSE1.273 ×10−31.421×10−31.699×10−31.657×10−3
(5.050 ×10−4)(5.390 ×10−4)(6.636 ×10−4)(7.597 ×10−4)
MAE2.752 ×10−22.878×10−23.147×10−23.041×10−2
(6.168 ×10−3)(6.303 ×10−3)(6.773 ×10−3)(7.557 ×10−3)
MXDV9.988 ×10−21.087×10−11.211×10−11.202×10−1
(2.181 ×10−2)(2.391 ×10−2)(3.105 ×10−2)(2.699 ×10−2)
Running time(s)18.70410118.2887.3773.132

4000.05MSE1.233 ×10−41.017×10−41.884×10−42.346×10−4
(4.070 ×10−5)(4.643 ×10−5)(6.421 ×10−5)(9.956 ×10−5)
MAE8.155 ×10−36.983×10−39.644×10−39.976×10−3
(1.582 ×10−3)(1.774 ×10−3)(1.540 ×10−3)(2.532 ×10−3)
MXDV4.456×10−24.544×10−25.304×10−26.531×10−2
(7.139 ×10−3)(7.926 ×10−3)(8.794 ×10−3)(8.542 ×10−3)
Running time(s)38.51111994.4276.74316.322

0.15MSE4.548 ×10−47.079×10−45.426×10−47.847×10−4
(2.593 ×10−4)(3.453 ×10−4)(4.121 ×10−4)(5.972 ×10−4)
MAE1.563 ×10−21.984×10−21.627×10−22.031×10−2
(3.966 ×10−3)(4.903 ×10−3)(4.887 ×10−3)(7.240 ×10−3)
MXDV7.232×10−27.989×10−21.663×10−17.988×10−2
(2.003 ×10−2)(2.058 ×10−2)(3.422 ×10−2)(2.245 ×10−2)
Running time(s)41.17621368.3379.49117.210

The unit of running time is seconds.


Table 3

The average of each criterion for over 100 runs of ADMM, CDA, glmnet and QP in Example 3 with the standard error in parentheses and a sample size n = 200, 400 and σ = 0.05, 0.2

nσEvaluationADMMCDAglmnetQP
2000.05MSE1.468 ×10−42.017 ×10−45.490 ×10−41.940 ×10−2
(8.651 ×10−5)(9.837 ×10−5)(1.415 ×10−4)(1.799 ×10−4)
MAE9.333 ×10−31.040 ×10−21.808 ×10−21.124 ×10−1
(2.686 ×10−3)(2.536 ×10−3)(2.457 ×10−3)(1.426 ×10−3)
MXDV3.470 ×10−25.451 ×10−28.501 ×10−25.505 ×10−1
(1.474 ×10−2)(2.453 ×10−2)(1.658 ×10−2)(1.396 ×10−2)
Running time(s)6.36317692.4033.5751.904

0.20MSE2.954 ×10−32.042 ×10−32.943 ×10−32.302 ×10−2
(5.149 ×10−3)(1.178 ×10−3)(1.274 ×10−3)(2.676 ×10−3)
MAE3.760 ×10−23.371 ×10−24.254 ×10−21.278 ×10−1
(2.362 ×10−2)(9.784 ×10−3)(9.692 ×10−3)(1.110 ×10−2)
MXDV1.478 ×10−11.481 ×10−11.711 ×10−14.770 ×10−1
(8.882 ×10−2)(7.437 ×10−2)(6.591 ×10−2)(7.405 ×10−2)
Running time(s)6.54019046.3763.6001.797

4000.05MSE6.297 ×10−51.053 ×10−45.704 ×101.157 ×10−2
(2.580 ×10−5)(5.354 ×10−5)(1.400 ×10−4)(8.007 ×10−4)
MAE6.125 ×10−37.558 ×10−31.801 ×10−28.339 ×10−2
(1.359 ×10−3)(1.818 ×10−3)(2.415 ×10−3)(3.288 ×10−3)
MXDV2.519 ×10−24.318 ×10−29.274 ×10−24.827 ×10−1
(9.013 ×10−3)(1.837 ×10−2)(1.468 ×10−2)(1.733 ×10−2)
Running time(s)14.47145454.8774.38810.318

0.20MSE1.209 ×10−31.151 ×10−32.058 ×10−32.030 ×10−2
(2.583 ×10−3)(6.202 ×10−4)(7.298 ×10−4)(1.225 ×10−3)
MAE2.480 ×10−22.539 ×10−23.579 ×10−21.181 ×10−2
(1.308 ×10−2)(7.034 ×10−3)(7.061 ×10−3)(6.343 ×10−3)
MXDV1.049 ×10−11.246 ×10−11.657 ×10−15.281 ×10−1
(6.113 ×10−2)(5.668 ×10−2)(6.375 ×10−2)(5.102 ×10−2)
Running time(s)14.28747062.5364.4419.643

The unit of running time is seconds.


Table 4

The average of each criterion for over 100 runs of ADMM, CDA, glmnet and QP in Example 4 with the standard error in parentheses and a sample size n = 200, 400 and σ = 0.05, 0.4

nσEvaluationADMMCDAglmnetQP
2000.05MSE1.408 ×10−14.572 ×10−14.694 ×10−17.916 ×10−1
(1.605 ×10−4)(1.452 ×10−3)(9.089 ×10−3)(1.308 ×10−3)
MAE1.863 ×10−14.431 ×10−14.606 ×10−17.200 ×10−1
(1.559 ×10−3)(8.955 ×10−4)(6.040 ×10−3)(1.402 ×10−3)
MXDV1.5881.9971.9571.958
(1.481 ×10−2)(9.172 ×10−3)(2.232 ×10−2)(5.708 ×10−3)
Running time(s)17.10917562.52949.3812.590

0.40MSE1.582 ×10−14.919 ×10−14.905 ×10−18.071 ×10−1
(5.170 ×10−3)(3.887 ×10−2)(1.671 ×10−2)(2.075 ×10−2)
MAE2.480 ×10−14.900 ×10−14.9046 ×10−17.313 ×10−1
(1.200 ×10−2)(4.009 ×10−2)(1.632 ×10−2)(1.658 ×10−2)
MXDV1.5941.9791.9601.941
(9.123 ×10−2)(7.078 ×10−2)(7.410 ×10−2)(4.811 ×10−2)
Running time(s)17.13517706.05440.1872.564

4000.05MSE8.010 ×10−24.584 ×10−14.795 ×10−17.870 ×10−1
(6.071 ×10−5)(3.130 ×10−3)(2.340 ×10−2)(7.067 ×10−3)
MAE1.184 ×10−14.435 ×10−14.765 ×10−17.209 ×10−1
(1.367 ×10−3)(2.097 ×10−3)(1.610 ×10−2)(6.422 ×10−3)
MXDV1.4562.0061.9311.961
(1.650 ×10−2)(1.194 ×10−2)(4.342 ×10−2)(1.052 ×10−2)
Running time(s)34.90942723.80465.84511.769

0.40MSE9.635 ×10−25.616 ×10−15.141 ×10−17.875 ×10−1
(3.849 ×10−3)(5.499 ×10−2)(2.463 ×10−2)(6.964 ×10−3)
MAE1.879 ×10−15.531 ×10−15.127 ×10−17.211 ×10−1
(1.123 ×10−2)(5.936 ×10−2)(2.102 ×10−2)(8.248 ×10−3)
MXDV1.5331.9661.9401.964
(9.103 ×10−2)(9.155 ×10−2)(6.290 ×10−2)(3.708 ×10−2)
Running time(s)34.82942749.77464.71911.504

The unit of running time is seconds.


References
  1. Anderson D and Burnham K (2004). Model selection and multi-model inference (Vol. 63, pp. 10), Second NY, Springer-Verlag.
  2. Boyd S, Parikh N, Chu E, Peleato B, and Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3, 1-122.
    CrossRef
  3. Bozdogan H (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345-370.
    CrossRef
  4. Donoho DL and Johnstone IM (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425-455.
    CrossRef
  5. Friedman J, Hastie T, and Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1.
    Pubmed KoreaMed CrossRef
  6. Friedman J, Hastie T, and Tibshirani R, et al. (2021) Package ‘glmnet’ .
  7. Goldfarb D and Idnani A (1983). A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27, 1-33.
    CrossRef
  8. Jhong JH and Koo JY (2019). Simultaneous estimation of quantile regression functions using B-splines and total variation penalty. Computational Statistics & Data Analysis, 133, 228-244.
    CrossRef
  9. Jhong JH, Koo JY, and Lee SW (2017). Penalized B-spline estimator for regression functions using total variation penalty. Journal of Statistical Planning and Inference, 184, 77-93.
    CrossRef
  10. Lee EJ, Byambaa A, and Jhong JH (2022). A data analysis of natural gas prices using penalized median regression splines. Journal of the Korean Data & Information Science Society, 33, 1125-1140.
    CrossRef
  11. Lee EJ and Jhong JH (2021). Change point detection using penalized multidegree splines. Axioms, 10, 331.
    CrossRef
  12. Lee EJ and Jhong JH (2022). Mixed degree regression function estimation using weighted lasso penalty. Journal of the Korean Data & Information Science Society, 33, 517-531.
    CrossRef
  13. Loader C and Liaw MA (2024) Package ‘locfit’ .
  14. McCann MT and Wohlberg B (2024). Robust and Simple ADMM Penalty Parameter Selection. IEEE Open Journal of Signal Processing, 5, 402-420.
    CrossRef
  15. O’Brien CM (2016). Statistical learning with sparsity: the lasso and generalizations.
  16. Oh JK and Jhong JH (2021). Pliable regression spline estimator using auxiliary variables. Communications for Statistical Applications and Methods, 28, 537-551.
    CrossRef
  17. Osborne MR, Presnell B, and Turlach BA (1998). Knot selection for regression splines via the lasso. Computing Science and Statistics, 44-49.
  18. Osborne MR, Presnell B, and Turlach BA (2000). On the Lasso and its dual. Journal of Computational and Graphical Statistics, 9, 319-337.
    CrossRef
  19. Schwarz G (1978). Estimating the dimension of a model. The Annals of Statistics, 461-464.
  20. Simon N, Friedman J, and Hastie T (2013). A blockwise descent algorithm for group-penalized multiresponse and multinomial regression.
  21. Stone M (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society: Series B (Methodological), 39, 44-47.
    CrossRef
  22. Tay JK, Narasimhan B, and Hastie T (2023). Elastic net regularization paths for all generalized linear models. Journal of Statistical Software, 106, 1-31.
    Pubmed KoreaMed CrossRef
  23. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58, 267-288.
    CrossRef
  24. Weingessel A (2011) Quadprog: Functions to solve quadratic programming problems. R package version 1 , 5-4.
  25. Wright SJ (2015). Coordinate descent algorithms. Mathematical Programming, 151, 3-34.
    CrossRef
  26. Xu Z, Figueiredo M, and Goldstein T (2017). Adaptive ADMM with spectral penalty parameter selection. In Artificial Intelligence and Statistics, 718-727.