As a subclass of non-negative data, semicontinuous data comprises a continuous distribution of positive values and a probability mass function of a nonnegligible number of zero values. We encounter semicontinuous data widely in various applications. Health care expenditures and medical costs are well-known applications (Duan
The presence of two heterogeneous distributions of semicontinuous data makes ordinary least squares (OLS) estimation biased and inefficient. The two-part model (TPM) has gained popularity as an alternative to OLS. The TPM separately models the binary response (either zero or positive) and the mean response given that it is positive (Duan
Penalized (or regularized) regressions are useful in high-dimensional data in which a few number of predictors may contribute to modeling the response. Penalized regressions simultaneously perform variable selection and parameter estimation. Popular penalized regression methods based on soft thresholdinging include the least angle shrinkage and selection operator (LASSO) (Tibshirani, 1996), elastic net (ENET) (Zou and Hastie, 2005), adaptive LASSO (Zou, 2006), group LASSO (Yuan and Lin, 2006), Dantzig selector (Candes and Tao, 2007), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), and minimax concave penalty (MCP) (Zhang, 2010). Since recent variations of penalized regressions (Hao
The TPM offers several intriguing characteristics. First, the two models for binary outcome and continuous positive response utilize the same predictor variables. However, the significance of individual predictors may differ between the two response models. Second, modeling the binary response uses a larger number of observations than the continuous positive response. In other words, not only is the entire sample size important, but the proportion of the positive values is important to modeling TPM. The conditional linear model for the continuous positive response may suffer from a smaller sample size. Last, the marginal mean of the response is a function of the probability that the responses are positive and the marginal mean of the positive responses. These characteristics deserve thorough investigation while performing variable selection and prediction in the TPM.
The simulation study probes the effects of various statistical scenarios in controlled settings. Our empirical study considers the prediction of crime incidents. In recent years, there have been significant advances in predictive modeling of crime incidents because of its societal implications and importance. Crime prediction has employed a broad spectrum of machine learning algorithms and data sources. These prediction algorithms encompass nonparameteric regression, support vector machine, and deep neural network, and the crime data include spatiotemporal data, social media data, and community-based data (Kang and Kang, 2017 and references therein). Our predictive modeling incorporates penalized regression using community-based data. The regression approach has advantages of identification and interpretation of the predictors associated with crime incidents.
The remaining sections are as follows. Section 2 briefly introduces the two-part model and its mean squared error. Section 3 describes regularized regression-based variable method and estimation. Section 4 describes the design and result of simulation study using the methods. Section 5 implements empirical data analysis using the community-based crime data. Last, Section 6 discusses the findings and related issues from both simulation study and empirics, concludes the study, and suggests further research areas.
Let
Let
A conventional two-part model of
where
The logistic regression in (
where the error terms
The likelihood function of the two-part model in (
TPM. The log-likelihood function of TPM, denoted
The expected value of the response variable,
Plugging (
As mentioned in the introduction, the expected value in (
where
which will be used to evaluate prediction performance in the following sections.
In this section, we will discuss how we select and estimate and , each of which are subvectors of
For the purpose of variable selection and prediction via penalized regression, we consider a penalized log-likelihood of TPM, denoted
where the nonnegative penalty functions are defined as
in which the tuning parameter
where the value of
where
Although the LASSO regression has substantial advantages, this method suffers from biased estimate under a certain condition (Zou, 2006). The solution of the bridge penalty is only continuous if
for
for
The original LASSO solution employed quadratic programming in Tibshirani (1996). Efron
The regularized regression method simultaneously achieves variable selection and parameter estimation, which are encompassed in the oracle property that an oracle estimator must hold asymptotic consistency in both variable selection and parameter estimation. Fan and Li (2001) and Zou (2006) posited that a good selection and estimation procedure should hold these two oracle properties. If
Numerical optimization is a serious challenge in the regularized regression. Tibshirani (1996) used quadratic programming (QP) with a convex constraint as a special case of convex optimization to find the LASSO solution. The LARS algorithm is another technique to find the piecewise linear path, where the LARS algorithm is a homotopy method in the sense that the piecewise linear path is sequentially constructed. Meanwhile, SCAD and MCP encountered more serious challenge due to the nonconvexcity of penalties. The local linear approximation (LLA) algorithm using the LARS algorithm was proposed to find the solution for SCAD and MCP (Zou and Li, 2008). Regardless of the convexity or nonconvexity of the penalties, the CDA method was proposed as a fast and computationally efficient optimization method. The CDA method achieves the optimization of an objective function for a single parameter with fixing all other parameters and iteratively cycling through all parameters until convergence is achieved. The computational efficiency of CDA is
The goal of parameter estimation is to achieve an (asymptotically) unbiased and consistent estimator. However, the goal of variable selection is more complex. The possible goals of variable selection include sensitivity, specificity, predictability, and selection consistency (Dziak
Our simulation study aimed to investigate the performance of selected methods described in the previous section under various situations. The selected methods included LASSO, ENET, SCAD, and MCP. If
The design of the simulation study for the two-part model considered the following issues. First, we considered two different sample sizes,
The first parameter space comprises few strong coefficients, which was used in the seminal LASSO paper (Tibshirani, 1996) and many other studies (Fan and Li, 2005; Zou and Hastie, 2001). The second parameter space was introduced to investigate how the selected penalized regression methods perform for diverse coefficient values.
Third, the covariance structure among predictors is one of the most important factors affecting the variable selection performance. We considered the independent and AR(1) covariance structures as follows:
Independence: ∑
AR(1) correlation:
Our simulation study used that
We focus on the two types of performance evaluation: Variable selection and prediction performances. These performance measures are closely related to the oracle properties. The variable selection performance is measured by
where is either or described in section 2, and for the estimated parameter space
The prediction performance for the GLM and LM is measured separately using accuracy, sensitivity, and specificity as metrics for the GLM, and the mean squared prediction error (MSPE) for the LM. The GLM metrics are used for the classification between zero and positive values. The mean absolute deviation serves as an alternative to the MSPE. These metrics are all evaluated in the testing data set. It is considered that the lower the MSPE, the better the prediction performance of the LM, while for the GLM, values of accuracy, sensitivity, and specificity closer to one indicates better classification performance.
Tables 1
As evident in Table 1, all the methods achieved better
For the parameter space P2, which contains coefficients generated from the uniform distribution between 0.5 and 2, the selected methods overall presented underperformance in variable selection and prediction assessment, as can be seen Tables 3 and 4. Unlike the results of P1, P2 demonstrated that the LASSO and ENET methods outperform the SCAD and MCP methods in terms of
Our simulation results showed that penalized TPMs achieve better performance in both variable selection and prediction in high-dimensional data, regardless of covariance structures and sample sizes. The selected nonconvex methods such as SCAD and MCP exhibited better performance in
Our empirical study for TPM was conducted using community-based crime data. Crime data can be collected in many ways. One common way is to collect crime data for each community such as city or town. Each community possesses its idiosyncratic characteristics with respect to demographics and socioeconomic status. Redmond and Baveja (2002) generated a comprehensive community-based crime data which is available at UC Irvine Machine Learning Repository(https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime). The data set contains 2,215 observations (communities), 124 predictors, and 18 response variables (9 different types of crimes with the original frequency and the frequency per 100,000 inhabitants) from multiple original data sources such as socio-economic data from the 1990 Census, law enforcement data from the 1990 Law Enforcement Management and Administrative Statistics (LEMAS), and crime data from the 1995 FBI Uniform Crime Report (UCR) Statistics.
The Census data include age-, race-, income-, family-, and house-related variables. The UCR data contained the original counts and the counts per 100,000 inhabitants for murder, rape, robbery, assault, burglary, larceny, auto theft, and arson. The LEMAS data collected policing-related data from state and local law enforcement agencies, including all those that employ 100 or more sworn police officers and a nationally representative sample of smaller agencies. Many communities with a small number of sworn officers had missing values for the predictors from the LEMAS data.
Among the 9 types of crime, murder, arson, and rape showed semicontinuous distributions, where these types of crime contained a considerable number of zero values over 10% of the total number of observations and demonstrated a skewed right distribution as the mean is much greater than the median. This empirical study focused on the response variable of the murder incidents per 100,000 habitants of which semicontinuous property is demonstrated in Figure 1. In Figure 1, the right-side figure illustrated the zero values and a bell-shaped curve of log-transformed positive values, which was modeled via the TPM. After removing the 23 LEMAS variables and any communities with significant amount of missing values, our final analytical data set consisted of one dependent variable and 101 predictors in 1,901 communities.
In order to check multicollinearity among 101 predictors, first we identified the perfect linearity (and hence, multicollinearity) between OwnOccQrange and OwnOccLowQuart/OwnOccHiQuart as well as RentQrange and RentLowQ/RentHighQ using the alias function for the lm function in R. Therefore, we removed two variables, OwnOccQrange and RentQrange. Using the variance influence factor (VIF), we examined multicollinearity among 99 predictors. The variance inflation factor (VIF) analysis shows that the 83 out of 99 predictors had the squared VIF value greater than 2, which indicates that the community-based crime data presented severe multicollinearity among predictors. In summary, the community-based crime data was characterized by skewed-right responses with semi-continuity and a fairly large number of predictor with multicollinearity. We identified a set of predictors via various variable selection methods and evaluated their prediction performance described in the previous sections. We split the whole data into the training and test data sets with 1 : 1 ratio, which resulted in a slightly different sample sizes for the linear regression in the training and test data sets.
In Table 5, we demonstrated the MSPE and the number of predictors selected by each methods. This result should be cautiously interpreted because a different sampling of training and test data may lead to a different result. Overall, the folded concave penalty methods of SCAD and MCP outperformed other methods with respect to the MSPE. The MSPE of MCP is 12% lower than the ENET one. AIC and ENET methods tend to select a higher number of predictors, and BIC and MCP tend to select a smaller number of predictors, which is consistent to the simulation results reported in the previous section.
In Figure 2, we presented two Venn diagrams of predictors selected by the four penalized methods from both GLM and LM in Table 5 to closely look at the patterns among the selected predictors. The LASSO predictors were mostly selected by the ENET method, and the MCP predictors were mostly selected by SCAD. The LASSO and ENET selected more predictors than SCAD and MCP in both models. The four penalized methods in both GLM and LM commonly selected seven predictors, although their compositions were different. The commonly selected variables from the logistic regression among four methods included population, blackPerCap, PersPerOwnOccHous, PersPerRentOccHous, MedNumBR, LandArea, and racePctWhite, and the commonly selected variables from the linear regression among four methods included PersPerFam, NumKidsBornNeverMar, NumImmig, PersPerOwnOccHous, PopDens, racePctWhite, and PctWorkMomYoungKids where the two predictors, PersPerOwnOccHous(mean persons per owner occupied household) and racePctWhite (percentage of population that is Caucasian) were common for both models. For further information, refer to the dictionary of these predictors at the UCI Machine Learning Repository mentioned above.
In this study, we investigated penalized regression-based variable selection methods for two-part models. We conducted simulation studies under diverse statistical assumptions and an empirical study using community-based crime data. Our analytical results demonstrated that penalized TPMs achieve better performance in both variable selection and prediction in high-dimensional data, regardless of covariance structures and sample sizes. Moreover, the LASSO-type methods such as LASSO and ENET outperformed the nonconvex methods such as SCAD and MCP in mean squared error. In simulation studies, for a small number of predictors, for example,
Variable selection in the TPM is affected by several unique features in addition to conventional matters, such as high dimensionality, multicollinearity, covariance structure, and sparsity. These TPM-specific features include the same pool of predictors for both models in (
Additionally, the simulation study also showed that the best variable selection may not be associated with the least prediction error. The trade-off between consistent variable selection and efficient prediction is well addressed in Ng (2013) and references therein. The convex penalty methods and the folded concave penalty methods show different behaviors in sensitivity and specificity. When the coefficients are quite different from zero as in Table 1, both penalty methods performed well for sensitivity. On the other hand, when coefficients were generated from Uniform[0.25, 2] as in Tables 3 and 4, the two methods exhibited similar sensitivities, or the convex methods marginally performed better. The folded concave methods outperformed the convex penalty methods in the specificity performance regardless of parameter spaces and covariance structures. This result is partly because the convex penalty methods are inclined to include more predictors with non-zero coefficient estimates, as explained in Breheny and Huang (2011).
Our current study can be extended in several directions. First, as we only considered the selected number of methods, it is worthwhile to consider recently developed variable selection methods especially for high dimensional data such as interaction selection (Hao
Variable selection of parameter space P1 in the two-part model for two covariance structures
Method | Independent covariance structure | |||||
---|---|---|---|---|---|---|
Logistic regression | Linear regression | |||||
500 | 20 | AIC | 1(0) | 0.832(0.007) | 0.838(0.016) | 0.803(0.008) |
BIC | 1(0) | 0.984(0.002) | 0.983(0.006) | 0.979(0.002) | ||
LASSO | 1(0) | 0.774(0.021) | 1(0) | 0.590(0.014) | ||
ENET | 1(0) | 0.720(0.021) | 1(0) | 0.443(0.013) | ||
SCAD | 1(0) | 0.938(0.007) | 1(0) | 0.914(0.010) | ||
MCP | 1(0) | 0.964(0.006) | 1(0) | 0.957(0.007) | ||
100 | LASSO | 1(0) | 0.930(0.007) | 1(0) | 0.828(0.006) | |
ENET | 1(0) | 0.896(0.008) | 1(0) | 0.743(0.007) | ||
SCAD | 1(0) | 0.975(0.003) | 1(0) | 0.961(0.003) | ||
MCP | 1(0) | 0.989(0.002) | 1(0) | 0.982(0.002) | ||
1000 | LASSO | 1(0) | 0.992(0.001) | 1(0) | 0.977(0.001) | |
ENET | 1(0) | 0.990(0.001) | 1(0) | 0.968(0.001) | ||
SCAD | 1(0) | 0.994(0.000) | 1(0) | 0.990(0.001) | ||
MCP | 1(0) | 0.998(0.000) | 1(0) | 0.998(0.000) | ||
1000 | 20 | AIC | 1(0) | 0.834(0.006) | 0.840(0.015) | 0.812(0.008) |
BIC | 1(0) | 0.989(0.002) | 0.993(0.003) | 0.985(0.002) | ||
LASSO | 1(0) | 0.806(0.019) | 1(0) | 0.587(0.014) | ||
ENET | 1(0) | 0.732(0.020) | 1(0) | 0.428(0.013) | ||
SCAD | 1(0) | 0.952(0.007) | 1(0) | 0.949(0.008) | ||
MCP | 1(0) | 0.954(0.007) | 1(0) | 0.960(0.007) | ||
100 | LASSO | 1(0) | 0.914(0.009) | 1(0) | 0.849(0.006) | |
ENET | 1(0) | 0.888(0.010) | 1(0) | 0.766(0.006) | ||
SCAD | 1(0) | 0.981(0.003) | 1(0) | 0.976(0.003) | ||
MCP | 1(0) | 0.990(0.002) | 1(0) | 0.991(0.002) | ||
1000 | LASSO | 1(0) | 0.993(0.001) | 1(0) | 0.983(0.001) | |
ENET | 1(0) | 0.988(0.001) | 1(0) | 0.972(0.001) | ||
SCAD | 1(0) | 0.997(0.000) | 1(0) | 0.996(0.001) | ||
MCP | 1(0) | 0.999(0.000) | 1(0) | 0.999(0.000) | ||
500 | 20 | AIC | 1(0) | 0.810(0.007) | 0.817(0.016) | 0.801(0.008) |
BIC | 1(0) | 0.987(0.002) | 0.978(0.006) | 0.979(0.003) | ||
LASSO | 1(0) | 0.814(0.017) | 1(0) | 0.669(0.013) | ||
ENET | 1(0) | 0.694(0.021) | 1(0) | 0.538(0.013) | ||
SCAD | 1(0) | 0.921(0.007) | 1(0) | 0.916(0.008) | ||
MCP | 1(0) | 0.970(0.005) | 1(0) | 0.953(0.007) | ||
100 | LASSO | 1(0) | 0.938(0.008) | 1(0) | 0.868(0.005) | |
ENET | 1(0) | 0.910(0.008) | 1(0) | 0.804(0.006) | ||
SCAD | 1(0) | 0.957(0.003) | 1(0) | 0.955(0.003) | ||
MCP | 1(0) | 0.984(0.002) | 1(0) | 0.985(0.002) | ||
1000 | LASSO | 1(0) | 0.992(0.001) | 1(0) | 0.985(0.001) | |
ENET | 1(0) | 0.992(0.001) | 1(0) | 0.975(0.001) | ||
SCAD | 1(0) | 0.992(0.000) | 1(0) | 0.991(0.001) | ||
MCP | 1(0) | 0.998(0.000) | 1(0) | 0.998(0.000) | ||
1000 | 20 | AIC | 1(0) | 0.820(0.007) | 0.832(0.015) | 0.807(0.008) |
BIC | 1(0) | 0.989(0.002) | 0.988(0.005) | 0.986(0.002) | ||
LASSO | 1(0) | 0.797(0.019) | 1(0) | 0.657(0.013) | ||
ENET | 1(0) | 0.672(0.020) | 1(0) | 0.523(0.013) | ||
SCAD | 1(0) | 0.945(0.007) | 1(0) | 0.943(0.009) | ||
MCP | 1(0) | 0.969(0.005) | 1(0) | 0.967(0.007) | ||
100 | LASSO | 1(0) | 0.923(0.008) | 1(0) | 0.872(0.006) | |
ENET | 1(0) | 0.896(0.008) | 1(0) | 0.811(0.006) | ||
SCAD | 1(0) | 0.970(0.003) | 1(0) | 0.975(0.003) | ||
MCP | 1(0) | 0.987(0.002) | 1(0) | 0.991(0.002) | ||
1000 | LASSO | 1(0) | 0.995(0.001) | 1(0) | 0.984(0.001) | |
ENET | 1(0) | 0.993(0.001) | 1(0) | 0.976(0.001) | ||
SCAD | 1(0) | 0.995(0.000) | 1(0) | 0.995(0.000) | ||
MCP | 1(0) | 0.998(0.000) | 1(0) | 0.999(0.000) |
The table reports the simulation mean (standard error) based on 200 iterations.
Prediction performance of parameter space P1 for two covariance structures
Method | Independent covariance structure | ||||||
---|---|---|---|---|---|---|---|
Mean squared error | Classification | ||||||
Positive | All | Accuracy | Sensitivity | Specificity | |||
500 | 20 | AIC | 23.08(0.21) | 23.83(0.22) | 0.874(0.001) | 0.852(0.002) | 0.889(0.001) |
BIC | 22.68(0.21) | 23.24(0.21) | 0.877(0.001) | 0.857(0.002) | 0.892(0.001) | ||
LASSO | 23.17(0.20) | 24.12(0.19) | 0.876(0.001) | 0.865(0.002) | 0.884(0.001) | ||
ENET | 23.21(0.20) | 24.16(0.19) | 0.874(0.001) | 0.871(0.002) | 0.877(0.002) | ||
SCAD | 24.04(0.19) | 25.24(0.19) | 0.878(0.001) | 0.858(0.002) | 0.893(0.001) | ||
MCP | 25.34(0.18) | 25.23(0.18) | 0.879(0.001) | 0.859(0.002) | 0.892(0.001) | ||
100 | LASSO | 22.91(0.2) | 23.62(0.20) | 0.862(0.001) | 0.861(0.002) | 0.864(0.002) | |
ENET | 22.87(0.21) | 23.56(0.20) | 0.859(0.001) | 0.876(0.002) | 0.851(0.002) | ||
SCAD | 23.77(0.18) | 24.83(0.17) | 0.867(0.001) | 0.845(0.002) | 0.881(0.001) | ||
MCP | 24.84(0.16) | 24.85(0.17) | 0.867(0.001) | 0.845(0.002) | 0.881(0.001) | ||
1000 | LASSO | 23.55(0.22) | 24.16(0.21) | 0.860(0.001) | 0.865(0.002) | 0.859(0.002) | |
ENET | 23.52(0.22) | 24.08(0.22) | 0.852(0.001) | 0.885(0.002) | 0.837(0.002) | ||
SCAD | 24.90(0.19) | 26.04(0.18) | 0.867(0.001) | 0.846(0.002) | 0.882(0.001) | ||
MCP | 26.14(0.17) | 25.96(0.18) | 0.868(0.001) | 0.846(0.002) | 0.882(0.001) | ||
1000 | 20 | AIC | 25.62(0.26) | 26.18(0.25) | 0.875(0.001) | 0.853(0.001) | 0.891(0.001) |
BIC | 25.58(0.26) | 26.11(0.26) | 0.878(0.001) | 0.854(0.001) | 0.894(0.001) | ||
LASSO | 27.19(0.23) | 28.40(0.23) | 0.877(0.001) | 0.863(0.001) | 0.886(0.001) | ||
ENET | 27.08(0.23) | 28.30(0.23) | 0.876(0.001) | 0.867(0.001) | 0.882(0.001) | ||
SCAD | 28.39(0.24) | 29.76(0.24) | 0.878(0.001) | 0.854(0.001) | 0.894(0.001) | ||
MCP | 29.99(0.24) | 29.76(0.24) | 0.878(0.001) | 0.854(0.001) | 0.894(0.001) | ||
100 | LASSO | 25.70(0.20) | 26.59(0.19) | 0.868(0.001) | 0.858(0.001) | 0.874(0.001) | |
ENET | 25.61(0.21) | 26.44(0.20) | 0.867(0.001) | 0.868(0.002) | 0.867(0.001) | ||
SCAD | 26.96(0.17) | 28.21(0.16) | 0.870(0.001) | 0.850(0.001) | 0.883(0.001) | ||
MCP | 28.50(0.15) | 28.22(0.16) | 0.870(0.001) | 0.850(0.001) | 0.883(0.001) | ||
1000 | LASSO | 25.03(0.22) | 25.69(0.21) | 0.873(0.001) | 0.871(0.002) | 0.874(0.001) | |
ENET | 24.97(0.22) | 25.54(0.22) | 0.868(0.001) | 0.885(0.002) | 0.860(0.001) | ||
SCAD | 26.56(0.17) | 27.72(0.16) | 0.875(0.001) | 0.856(0.002) | 0.888(0.001) | ||
MCP | 27.85(0.15) | 27.78(0.16) | 0.875(0.001) | 0.856(0.002) | 0.888(0.001) | ||
500 | 20 | AIC | 26.66(0.23) | 27.54(0.24) | 0.884(0.001) | 0.866(0.002) | 0.898(0.001) |
BIC | 26.02(0.22) | 26.62(0.22) | 0.889(0.001) | 0.872(0.002) | 0.902(0.001) | ||
LASSO | 26.66(0.19) | 27.72(0.18) | 0.887(0.001) | 0.876(0.002) | 0.896(0.001) | ||
ENET | 26.67(0.19) | 27.72(0.18) | 0.886(0.001) | 0.876(0.002) | 0.893(0.002) | ||
SCAD | 27.42(0.18) | 28.64(0.17) | 0.889(0.001) | 0.873(0.002) | 0.902(0.001) | ||
MCP | 28.74(0.16) | 28.62(0.17) | 0.889(0.001) | 0.873(0.002) | 0.902(0.001) | ||
100 | LASSO | 26.98(0.21) | 27.82(0.20) | 0.876(0.001) | 0.875(0.002) | 0.877(0.002) | |
ENET | 26.86(0.22) | 27.67(0.20) | 0.873(0.001) | 0.885(0.002) | 0.867(0.002) | ||
SCAD | 28.09(0.19) | 29.30(0.18) | 0.880(0.001) | 0.867(0.002) | 0.889(0.001) | ||
MCP | 29.30(0.16) | 29.31(0.18) | 0.880(0.001) | 0.867(0.002) | 0.889(0.001) | ||
1000 | LASSO | 27.19(0.22) | 27.88(0.21) | 0.880(0.001) | 0.876(0.002) | 0.884(0.002) | |
ENET | 27.12(0.22) | 27.74(0.22) | 0.875(0.001) | 0.888(0.002) | 0.868(0.002) | ||
SCAD | 28.97(0.17) | 30.29(0.16) | 0.880(0.001) | 0.862(0.002) | 0.894(0.001) | ||
MCP | 30.21(0.15) | 30.21(0.16) | 0.882(0.001) | 0.864(0.002) | 0.896(0.001) | ||
1000 | 20 | AIC | 28.73(0.22) | 29.36(0.22) | 0.886(0.001) | 0.867(0.001) | 0.900(0.001) |
BIC | 28.64(0.22) | 29.20(0.22) | 0.888(0.001) | 0.870(0.001) | 0.902(0.001) | ||
LASSO | 30.20(0.19) | 31.43(0.19) | 0.886(0.001) | 0.873(0.001) | 0.896(0.001) | ||
ENET | 30.16(0.20) | 31.36(0.19) | 0.886(0.001) | 0.875(0.001) | 0.894(0.001) | ||
SCAD | 31.16(0.19) | 32.56(0.19) | 0.888(0.001) | 0.870(0.001) | 0.902(0.001) | ||
MCP | 32.61(0.17) | 32.56(0.19) | 0.888(0.001) | 0.869(0.001) | 0.902(0.001) | ||
100 | LASSO | 30.09(0.21) | 31.24(0.20) | 0.883(0.001) | 0.871(0.001) | 0.892(0.001) | |
ENET | 29.90(0.21) | 30.98(0.20) | 0.881(0.001) | 0.877(0.001) | 0.885(0.001) | ||
SCAD | 31.77(0.18) | 33.24(0.18) | 0.884(0.001) | 0.865(0.001) | 0.898(0.001) | ||
MCP | 33.43(0.17) | 33.25(0.18) | 0.884(0.001) | 0.866(0.001) | 0.898(0.001) | ||
1000 | LASSO | 29.50(0.26) | 30.36(0.25) | 0.888(0.001) | 0.880(0.001) | 0.894(0.001) | |
ENET | 29.32(0.26) | 30.07(0.25) | 0.885(0.001) | 0.888(0.001) | 0.884(0.001) | ||
SCAD | 31.15(0.24) | 32.48(0.23) | 0.889(0.001) | 0.874(0.001) | 0.901(0.001) | ||
MCP | 32.66(0.23) | 32.51(0.23) | 0.889(0.001) | 0.874(0.001) | 0.901(0.001) |
The table reports the simulation mean (standard error) based on 200 iterations.
Variable selection of parameter space P2 via 200 for two covariance structures
Method | Independent covariance structure | |||||
---|---|---|---|---|---|---|
Logistic regression | Linear regression | |||||
500 | 20 | AIC | 0.944(0.004) | 0.815(0.012) | 0.823(0.008) | 0.809(0.013) |
BIC | 0.847(0.004) | 0.986(0.004) | 0.982(0.002) | 0.984(0.004) | ||
LASSO | 0.976(0.003) | 0.426(0.024) | 0.994(0.001) | 0.129(0.012) | ||
ENET | 0.981(0.003) | 0.388(0.024) | 0.994(0.001) | 0.098(0.011) | ||
SCAD | 0.959(0.004) | 0.663(0.021) | 0.965(0.003) | 0.485(0.022) | ||
MCP | 0.941(0.005) | 0.758(0.020) | 0.954(0.004) | 0.549(0.026) | ||
100 | LASSO | 0.928(0.005) | 0.801(0.011) | 0.941(0.004) | 0.579(0.007) | |
ENET | 0.930(0.005) | 0.772(0.011) | 0.946(0.003) | 0.514(0.007) | ||
SCAD | 0.906(0.004) | 0.909(0.003) | 0.913(0.004) | 0.851(0.004) | ||
MCP | 0.876(0.005) | 0.961(0.002) | 0.879(0.004) | 0.933(0.003) | ||
1000 | LASSO | 0.814(0.005) | 0.982(0.001) | 0.720(0.009) | 0.952(0.002) | |
ENET | 0.816(0.005) | 0.978(0.001) | 0.691(0.010) | 0.951(0.002) | ||
SCAD | 0.850(0.004) | 0.981(0.000) | 0.843(0.005) | 0.968(0.001) | ||
MCP | 0.815(0.004) | 0.995(0.000) | 0.805(0.004) | 0.990(0.000) | ||
1000 | 20 | AIC | 0.977(0.002) | 0.817(0.012) | 0.832(0.007) | 0.842(0.012) |
BIC | 0.916(0.003) | 0.991(0.003) | 0.982(0.002) | 0.986(0.004) | ||
LASSO | 0.988(0.002) | 0.499(0.025) | 0.999(0.001) | 0.116(0.012) | ||
ENET | 0.986(0.002) | 0.479(0.025) | 0.999(0.001) | 0.083(0.010) | ||
SCAD | 0.991(0.002) | 0.595(0.022) | 0.986(0.002) | 0.503(0.022) | ||
MCP | 0.980(0.003) | 0.676(0.024) | 0.976(0.003) | 0.605(0.025) | ||
100 | LASSO | 0.966(0.004) | 0.824(0.011) | 0.985(0.002) | 0.589(0.007) | |
ENET | 0.967(0.003) | 0.810(0.012) | 0.987(0.002) | 0.509(0.007) | ||
SCAD | 0.959(0.003) | 0.901(0.004) | 0.957(0.003) | 0.875(0.005) | ||
MCP | 0.941(0.004) | 0.955(0.003) | 0.930(0.004) | 0.943(0.003) | ||
1000 | LASSO | 0.881(0.005) | 0.983(0.001) | 0.872(0.004) | 0.945(0.001) | |
ENET | 0.863(0.005) | 0.982(0.001) | 0.864(0.005) | 0.932(0.001) | ||
SCAD | 0.917(0.004) | 0.984(0.001) | 0.909(0.004) | 0.977(0.001) | ||
MCP | 0.893(0.004) | 0.995(0.000) | 0.874(0.004) | 0.994(0.000) | ||
500 | 20 | AIC | 0.876(0.004) | 0.797(0.015) | 0.800(0.008) | 0.785(0.014) |
BIC | 0.789(0.003) | 0.976(0.006) | 0.972(0.004) | 0.971(0.006) | ||
LASSO | 0.952(0.004) | 0.652(0.025) | 0.979(0.003) | 0.282(0.017) | ||
ENET | 0.980(0.003) | 0.617(0.027) | 0.982(0.002) | 0.231(0.016) | ||
SCAD | 0.899(0.005) | 0.688(0.020) | 0.931(0.004) | 0.549(0.022) | ||
MCP | 0.863(0.006) | 0.776(0.020) | 0.919(0.005) | 0.588(0.024) | ||
100 | LASSO | 0.936(0.004) | 0.906(0.009) | 0.970(0.003) | 0.746(0.007) | |
ENET | 0.977(0.002) | 0.907(0.008) | 0.978(0.002) | 0.704(0.007) | ||
SCAD | 0.85(0.004) | 0.902(0.003) | 0.875(0.005) | 0.863(0.004) | ||
MCP | 0.794(0.004) | 0.957(0.002) | 0.826(0.005) | 0.936(0.003) | ||
1000 | LASSO | 0.905(0.005) | 0.989(0.001) | 0.929(0.004) | 0.964(0.001) | |
ENET | 0.950(0.004) | 0.989(0.001) | 0.950(0.004) | 0.956(0.001) | ||
SCAD | 0.755(0.005) | 0.985(0.000) | 0.803(0.004) | 0.970(0.001) | ||
MCP | 0.669(0.005) | 0.995(0.000) | 0.747(0.004) | 0.991(0.000) | ||
1000 | 20 | AIC | 0.934(0.004) | 0.817(0.013) | 0.816(0.008) | 0.800(0.014) |
BIC | 0.843(0.003) | 0.982(0.005) | 0.991(0.002) | 0.988(0.004) | ||
LASSO | 0.986(0.002) | 0.639(0.025) | 0.992(0.002) | 0.302(0.018) | ||
ENET | 0.996(0.001) | 0.589(0.026) | 0.994(0.001) | 0.264(0.017) | ||
SCAD | 0.941(0.004) | 0.646(0.021) | 0.953(0.004) | 0.593(0.023) | ||
MCP | 0.924(0.005) | 0.724(0.021) | 0.939(0.005) | 0.638(0.025) | ||
100 | LASSO | 0.973(0.003) | 0.885(0.009) | 0.989(0.002) | 0.745(0.006) | |
ENET | 0.995(0.001) | 0.870(0.010) | 0.992(0.002) | 0.691(0.006) | ||
SCAD | 0.900(0.004) | 0.906(0.003) | 0.908(0.004) | 0.902(0.004) | ||
MCP | 0.852(0.004) | 0.959(0.002) | 0.867(0.004) | 0.949(0.003) | ||
1000 | LASSO | 0.959(0.003) | 0.990(0.001) | 0.970(0.003) | 0.966(0.001) | |
ENET | 0.985(0.002) | 0.992(0.001) | 0.984(0.002) | 0.956(0.001) | ||
SCAD | 0.841(0.004) | 0.980(0.001) | 0.866(0.004) | 0.978(0.001) | ||
MCP | 0.786(0.004) | 0.994(0.000) | 0.815(0.004) | 0.994(0.000) |
The table reports the simulation mean (standard error) based on 200 iterations.
Prediction performance of parameter space P2 for two covariance structures
Method | Independent covariance structure | ||||||
---|---|---|---|---|---|---|---|
Mean squared error | Classification | ||||||
Positive | All | Accuracy | Sensitivity | Specificity | |||
500 | 20 | AIC | 30.26(0.31) | 31.04(0.31) | 0.890(0.001) | 0.878(0.002) | 0.899(0.001) |
BIC | 29.89(0.32) | 30.47(0.32) | 0.888(0.001) | 0.875(0.002) | 0.897(0.001) | ||
LASSO | 32.79(0.26) | 34.10(0.25) | 0.888(0.001) | 0.880(0.002) | 0.894(0.001) | ||
ENET | 32.79(0.26) | 34.14(0.25) | 0.889(0.001) | 0.885(0.002) | 0.892(0.001) | ||
SCAD | 33.41(0.26) | 34.74(0.26) | 0.889(0.001) | 0.876(0.002) | 0.899(0.001) | ||
MCP | 35.00(0.25) | 34.77(0.25) | 0.889(0.001) | 0.876(0.002) | 0.899(0.001) | ||
100 | LASSO | 30.81(0.30) | 31.90(0.29) | 0.878(0.001) | 0.874(0.002) | 0.881(0.002) | |
ENET | 30.74(0.30) | 31.82(0.29) | 0.875(0.001) | 0.879(0.002) | 0.873(0.002) | ||
SCAD | 33.80(0.30) | 35.29(0.30) | 0.883(0.001) | 0.868(0.002) | 0.895(0.001) | ||
MCP | 35.51(0.30) | 35.20(0.31) | 0.884(0.001) | 0.870(0.002) | 0.895(0.001) | ||
1000 | LASSO | 30.74(0.26) | 31.44(0.26) | 0.844(0.001) | 0.864(0.002) | 0.834(0.002) | |
ENET | 30.58(0.27) | 31.22(0.26) | 0.836(0.001) | 0.878(0.003) | 0.815(0.002) | ||
SCAD | 33.66(0.23) | 35.14(0.23) | 0.862(0.001) | 0.847(0.002) | 0.873(0.002) | ||
MCP | 34.81(0.21) | 34.81(0.22) | 0.864(0.001) | 0.850(0.002) | 0.875(0.002) | ||
1000 | 20 | AIC | 29.62(0.24) | 30.22(0.24) | 0.897(0.001) | 0.882(0.001) | 0.908(0.001) |
BIC | 29.59(0.24) | 30.15(0.24) | 0.896(0.001) | 0.881(0.001) | 0.908(0.001) | ||
LASSO | 31.62(0.21) | 32.97(0.20) | 0.897(0.001) | 0.886(0.001) | 0.904(0.001) | ||
ENET | 31.62(0.21) | 32.97(0.20) | 0.896(0.001) | 0.889(0.001) | 0.902(0.001) | ||
SCAD | 31.95(0.20) | 33.33(0.20) | 0.897(0.001) | 0.882(0.001) | 0.908(0.001) | ||
MCP | 33.50(0.19) | 33.30(0.20) | 0.897(0.001) | 0.882(0.001) | 0.908(0.001) | ||
100 | LASSO | 30.05(0.23) | 30.95(0.22) | 0.886(0.001) | 0.875(0.001) | 0.894(0.001) | |
ENET | 30.09(0.23) | 30.99(0.22) | 0.885(0.001) | 0.880(0.001) | 0.889(0.001) | ||
SCAD | 30.84(0.21) | 32.04(0.19) | 0.890(0.001) | 0.875(0.001) | 0.902(0.001) | ||
MCP | 31.97(0.18) | 32.07(0.19) | 0.890(0.001) | 0.875(0.001) | 0.902(0.001) | ||
1000 | LASSO | 28.77(0.20) | 29.40(0.19) | 0.872(0.001) | 0.873(0.001) | 0.872(0.001) | |
ENET | 28.78(0.20) | 29.38(0.20) | 0.868(0.001) | 0.883(0.001) | 0.859(0.001) | ||
SCAD | 30.08(0.15) | 31.35(0.14) | 0.881(0.001) | 0.864(0.001) | 0.894(0.001) | ||
MCP | 31.34(0.13) | 31.24(0.14) | 0.882(0.001) | 0.865(0.001) | 0.895(0.001) | ||
500 | 20 | AIC | 45.94(0.43) | 47.15(0.44) | 0.914(0.001) | 0.907(0.001) | 0.920(0.001) |
BIC | 43.84(0.35) | 44.65(0.36) | 0.912(0.001) | 0.906(0.001) | 0.918(0.001) | ||
LASSO | 46.37(0.31) | 47.97(0.31) | 0.914(0.001) | 0.907(0.001) | 0.921(0.001) | ||
ENET | 46.22(0.31) | 47.92(0.31) | 0.915(0.001) | 0.910(0.001) | 0.921(0.001) | ||
SCAD | 47.20(0.31) | 48.92(0.31) | 0.913(0.001) | 0.906(0.001) | 0.919(0.001) | ||
MCP | 49.08(0.30) | 48.93(0.31) | 0.913(0.001) | 0.907(0.001) | 0.919(0.001) | ||
100 | LASSO | 46.56(0.44) | 47.86(0.42) | 0.911(0.001) | 0.910(0.002) | 0.912(0.002) | |
ENET | 46.42(0.43) | 47.76(0.42) | 0.911(0.001) | 0.913(0.002) | 0.911(0.002) | ||
SCAD | 49.63(0.49) | 51.37(0.49) | 0.906(0.001) | 0.900(0.002) | 0.911(0.002) | ||
MCP | 51.56(0.50) | 51.36(0.50) | 0.907(0.001) | 0.900(0.002) | 0.912(0.002) | ||
1000 | LASSO | 42.57(0.27) | 43.49(0.26) | 0.904(0.001) | 0.912(0.002) | 0.899(0.002) | |
ENET | 42.40(0.28) | 43.27(0.26) | 0.906(0.001) | 0.920(0.002) | 0.896(0.002) | ||
SCAD | 47.06(0.25) | 48.68(0.25) | 0.890(0.001) | 0.886(0.002) | 0.894(0.002) | ||
MCP | 48.28(0.24) | 48.19(0.24) | 0.892(0.001) | 0.888(0.002) | 0.896(0.002) | ||
1000 | 20 | AIC | 45.49(0.39) | 46.30(0.39) | 0.922(0.001) | 0.913(0.001) | 0.930(0.001) |
BIC | 45.16(0.39) | 45.76(0.39) | 0.921(0.001) | 0.911(0.001) | 0.930(0.001) | ||
LASSO | 49.20(0.41) | 50.71(0.41) | 0.922(0.001) | 0.916(0.001) | 0.927(0.001) | ||
ENET | 49.18(0.40) | 50.7(0.41) | 0.922(0.001) | 0.918(0.001) | 0.925(0.001) | ||
SCAD | 49.79(0.41) | 51.36(0.41) | 0.922(0.001) | 0.912(0.001) | 0.930(0.001) | ||
MCP | 51.33(0.4) | 51.37(0.41) | 0.922(0.001) | 0.912(0.001) | 0.930(0.001) | ||
100 | LASSO | 44.13(0.24) | 45.43(0.22) | 0.927(0.001) | 0.924(0.001) | 0.930(0.001) | |
ENET | 44.07(0.24) | 45.32(0.23) | 0.927(0.001) | 0.927(0.001) | 0.927(0.001) | ||
SCAD | 46.24(0.21) | 47.82(0.20) | 0.926(0.001) | 0.917(0.001) | 0.933(0.001) | ||
MCP | 47.91(0.19) | 47.84(0.19) | 0.926(0.001) | 0.917(0.001) | 0.933(0.001) | ||
1000 | LASSO | 45.54(0.29) | 46.65(0.28) | 0.914(0.001) | 0.909(0.001) | 0.919(0.001) | |
ENET | 45.42(0.30) | 46.49(0.28) | 0.914(0.001) | 0.912(0.001) | 0.916(0.001) | ||
SCAD | 48.74(0.24) | 50.41(0.23) | 0.910(0.001) | 0.900(0.001) | 0.918(0.001) | ||
MCP | 50.57(0.22) | 50.33(0.23) | 0.912(0.001) | 0.902(0.001) | 0.920(0.001) |
The table reports the simulation mean (standard error) based on 200 iterations.
MSPE and the number of selected predictors
Sample size | Selection Methods | |||||||
---|---|---|---|---|---|---|---|---|
Train | Test | AIC | BIC | LASSO | ENET | SCAD | MCP | |
GLM | 941 | 941 | 30 | 10 | 23 | 44 | 9 | 7 |
LM | 492 | 530 | 38 | 4 | 30 | 34 | 25 | 22 |
MSPE | 77.48 | 61.15 | 45.28 | 44.74 | 43.26 | 43.37 |
* LASSO, least absolute shrinkage and selection operator; ENET, Elastic NeT; MCP, Minimax Concave Penalty; SCAD, Smoothly Clipped absolute Deviation; MSPE, Mean squared prediction error; The integer values in Selection Methods denote the number of predictors selected by each method.