Classification models pertaining to receiver operating characteristic (ROC) curve analysis have been extended from univariate to multivariate setup by linearly combining available multiple markers. One such classification model is the multivariate ROC curve analysis. However, not all markers contribute in a real scenario and may mask the contribution of other markers in classifying the individuals/objects. This paper addresses this issue by developing an algorithm that helps in identifying the important markers that are significant and true contributors. The proposed variable selection framework is supported by real datasets and a simulation study, it is shown to provide insight about the individual marker’s significance in providing a classifier rule/linear combination with good extent of classification.
The practical problems pertaining to binary classification have paved a path to statistical methodologies with a strong mathematical base. The problem of classifying individuals/objects can be done using a marker or a set of markers. Many researchers have developed methodologies over the past seven decades that use univariate and multivariate setup. The present paper is confined to a graphical tool that provides an aid to allocate individuals into one of two known populations/groups is the namely receiver operating characteristic (ROC) curve analysis. Many ideas were proposed on this ROC curve methodology under univariate setup of which Bamber (1975), Metz (1978), Hanley and McNeil (1982, 1983), Faraggi and Reiser (2002), Zhang (2006), Vishnu Vardhan and Sarma (2010), Balaswamy
Let
where
where
is the cut point obtained at optimal ‘
AUC lies between 0 and 1 and a test is said to be
Once the linear combination of markers is obtained, its validity can be checked using the concept of PLC proposed by Sameera and Vishnu Vardhan (2016). PLC is based on an
Here, ‘
In usual variable selection algorithms, initially a single variable will be included in the model for significance. The same logic cannot be applied in the proposed stepwise algorithm since it is based upon the concept of Precision which depends on correlation measures. Hence, it is imperative to begin with a pair of markers (variables) instead of a single marker. The following steps detail the procedural flow of how the algorithm is executed.
List out the
Compute the
In order to select the next marker that can be included into the model compute partial
Now, conducting a backward elimination step will help to remove any insignificant marker that is added to the model. To achieve this, compute the partial
Repeat steps 3 and 4 till a stage comes where no marker can be added into/removed from the model.
Figure 1 details out the logical process of the proposed stepwise algorithm.
Using the algorithm given in Figure 1, a subset of markers can be identified and linearly combined to arrive at a classification rule for classification of new individuals/objects. For illustration purposes in real datasets, the
The functionality of the proposed stepwise algorithm is demonstrated using three datasets by Statlog Heart data (Michie
A neonatal dataset is used to provide better understanding about the iterative steps of the proposed stepwise algorithm. An R code is developed to perform the stepwise algorithm where the first column in the data frame holds the status of the individual and subsequent columns contain markers in the following order: 2 = ear, 3 = sitenum, 4 = currage, 5 = gender, 6 = DPOAE 65 at 2kHz (
Iteration 1: There are 7 markers in the study, 7C2 = 21 combinations are listed. On computing the
Iteration 2: The process of forward selection is executed for the variables 2, 3, 4, 5, and 7. At this stage, the corresponding
Iteration 3: Backward elimination is executed in this iteration to examine if the included markers 7, 6, and 8 are significant enough to remain in the model. The partial
Iteration 4: The process of forward selection is repeated again on markers 2, 3, 4, and 5 of which marker 2 is included into the model because of its significant partial
Iteration 5: This step verifies the significance of the included markers; consequently, it is shown that all are significant and allowed to remain in the model.
Iteration 6: In this step, the significance of the markers 3, 4, and 5 is tested for possible inclusion into the model. As none of the markers are significant, the iterative procedure is terminated and the final list of markers involved in the model contain 2, 7, 6, and 8 i.e., ear, DPOAE 65 at 2kHz (y1), TEOAE 80 at 2kHz (y2), and ABR (
In addition to the above described iterative procedure for the Neonatal dataset, the model’s significance is tested using PLC. The full model is observed to provide an accuracy of 68.39% but the linear combination obtained in this case is found to have an insignificant
The accuracy of this stepwise model is 92.59% which when compared to the accuracy of the full model leads to a conclusion that the stepwise model is sufficient for classification. In some situations, we may come across a case where the linear combination will be significant for the full model. But it requires further investigation of the individual markers’ significance. The support of datasets from the Vertebral Column data is taken to show this kind of scenario. With respect to the Spondylolisthesis and Disk Hernia datasets, the linear combinations of full model are noticed to be significant at 94.77% and 89.61% respectively. However, 3 markers (LLA, PR, and GS) from Spondylolisthesis and 2 markers (PR and GS) from Disk Hernia are found to have insignificant partial
Table 1 provides the coefficients and partial
Simulation studies detail the algorithmic proposed for variable selection as well as observe the sensitivity of
Table 3
In simulation 1, the full model with 7 markers is observed to have an insignificant
The present paper provides a variable selection procedure for a multivariate extension of ROC curve technique known as MROC curve analysis. The idea behind the concept is to identify a subset of markers that provide true accuracy and a valid classifier rule/linear combination when combined. The proposed stepwise methodology is supported with the help of real datasets and a simulation study. Two cases are discussed, one when the obtained linear combination is insignificant and required to identify the subset that can provide a significant linear combination (Neonatal dataset and Heart dataset) and the other when the linear combination is significant, but also contains insignificant markers that influence the performance of the linear combination (Spondylolisthesis and Disk Hernia datasets). The sensitivity of
The author, Sameera G, would like to acknowledge Department of Science and Technology for supporting her research through a fellowship under DST-INSPIRE programme (IF130958).
Coefficients and partial
Dataset | Full model | Stepwise model | ||||||
---|---|---|---|---|---|---|---|---|
Variables | Coefficients | Partial |
sig. | Variables | Coefficients | Partial |
sig. | |
Neonatal ( |
Ear | −0.0963 | 0.4878 | 0.818 | Ear | −0.1100 | 4.3488 | 0.005 |
sitenum | 0.2264 | 5.6181 | 0.000 | 0.0258 | 267.4902 | 0.000 | ||
currage | 0.0295 | 5.6410 | 0.000 | 0.0405 | 268.1827 | 0.000 | ||
gender | −0.0350 | 0.2889 | 0.942 | 0.1574 | 7.0262 | 0.000 | ||
0.0410 | 29.7824 | 0.000 | ||||||
0.0210 | 29.8079 | 0.000 | ||||||
0.1817 | 1.3841 | 0.217 | ||||||
Model |
Model |
|||||||
Heart ( |
Age | −0.0220 | 0.2128 | 0.998 | nMBV | 1.2523 | 2.2158 | 0.042 |
rBP | 0.0187 | 0.0906 | 1.000 | CPT | 0.8904 | 3.2811 | 0.004 | |
SC | 0.0051 | 0.0712 | 1.000 | EIA | 1.3849 | 4.2655 | 0.000 | |
MHR | −0.0252 | 0.2792 | 0.992 | SPSTs | 0.5759 | 9.8091 | 0.000 | |
Oldpeak | 0.4243 | 0.3525 | 0.978 | Oldpeak | 0.5194 | 11.2808 | 0.000 | |
nMBV | 1.2687 | 0.1359 | 1.000 | Sex | 1.0877 | 3.1066 | 0.006 | |
thal | 0.5344 | 0.2190 | 0.997 | Thal | 0.5123 | 7.0360 | 0.000 | |
Sex | 1.3227 | 0.1355 | 1.000 | |||||
CPT | 0.8294 | 0.1260 | 1.000 | |||||
FBS | −0.7239 | 0.0349 | 1.000 | |||||
rECGr | 0.3584 | 0.0397 | 1.000 | |||||
EIA | 1.0912 | 0.1590 | 0.999 | |||||
SPSTs | 0.3983 | 0.3353 | 0.982 | |||||
Model |
Model |
|||||||
Spondylolisthesis ( |
PI | 5.3239 | 21972280 | 0.000 | PT | −12.3646 | 26133190 | 0.000 |
PT | −5.3321 | 8470798 | 0.000 | SS | −12.3594 | 37534480 | 0.000 | |
LLA | 0.0903 | 1.9174 | 0.092 | PI | 12.3776 | 67786610 | 0.000 | |
SS | −5.3588 | 12166540 | 0.000 | LLA | 0.0740 | 5.8392 | 0.000 | |
PR | −0.1123 | 0.5763 | 0.718 | GS | 0.1282 | 4.5012 | 0.002 | |
GS | 0.1006 | 1.5598 | 0.172 | |||||
Model |
Model |
|||||||
Disk Hernia ( |
PI | 14.8544 | 52420430 | 0.000 | LLA | −0.0321 | 18.3480 | 0.000 |
PT | −14.7483 | 19189290 | 0.000 | PT | −14.7275 | 24337630 | 0.000 | |
LLA | −0.0368 | 15.2383 | 0.000 | PI | 14.8343 | 66484970 | 0.000 | |
SS | −15.0232 | 35695070 | 0.000 | SS | −15.0035 | 45272420 | 0.000 | |
PR | −0.1483 | 2.1034 | 0.068 | PR | −0.1484 | 2.6678 | 0.034 | |
GS | 0.0356 | 0.8572 | 0.511 | |||||
Model |
Model |
Measures of MROC curve and optimal cutpoint - Real datasets
Dataset | Model | Opt |
(1 − |
AUC |
---|---|---|---|---|
Neonatal | Full | 0.6012 | (0.3673, 0.6327) | 0.6839 |
Stepwise | −1.2829 | (0.3838, 0.6162) | 0.6607 | |
Heart | Full | 7.2733 | (0.1377, 0.8623) | 0.9366 |
Stepwise | 8.6494 | (0.1508, 0.8492) | 0.9259 | |
Spondylolisthesis | Full | −9.1092 | (0.1057, 0.8943) | 0.9477 |
Stepwise | 6.0845 | (0.1259, 0.8741) | 0.9192 | |
Disk Hernia | Full | −23.2065 | (0.1845, 0.8155) | 0.8961 |
Stepwise | −23.1334 | (0.1859, 0.8141) | 0.8949 |
MROC = multivariate receiver operating characteristic; AUC = area under the curve.
Coefficients and partial
Dataset | Simulation 1 ( |
Simulation 2 ( |
||||||
---|---|---|---|---|---|---|---|---|
Variables | Coefficients | Partial |
sig. | Variables | Coefficients | Partial |
sig. | |
Full Model | 0.2665 | 0.2134 | 0.973 | −0.1008 | 0.2621 | 0.954 | ||
0.0881 | 2.1755 | 0.043 | 0.2519 | 0.7317 | 0.624 | |||
0.0400 | 0.1913 | 0.979 | 0.0550 | 0.6900 | 0.658 | |||
−0.0148 | 2.2509 | 0.036 | 0.1015 | 0.1338 | 0.992 | |||
0.0253 | 12.8950 | 0.000 | 0.0299 | 6.2444 | 0.000 | |||
0.0338 | 12.9578 | 0.000 | 0.0152 | 6.5322 | 0.000 | |||
0.1607 | 0.7327 | 0.623 | 0.1947 | 0.7925 | 0.576 | |||
Model |
Model |
|||||||
Stepwise Model 1 ( |
0.0269 | 306.0346 | 0.000 | −0.0855 | 2.9122 | 0.033 | ||
0.0306 | 302.3586 | 0.000 | 0.0217 | 79.2230 | 0.000 | |||
0.1456 | 9.7466 | 0.000 | 0.0281 | 76.0753 | 0.000 | |||
0.1696 | 7.4475 | 0.000 | ||||||
Model |
Model |
|||||||
Stepwise Model 2 ( |
0.0269 | 306.0346 | 0.000 | 0.0273 | 171.8816 | 0.000 | ||
0.0306 | 302.3586 | 0.000 | 0.0226 | 174.2524 | 0.000 | |||
0.1456 | 9.7466 | 0.000 | 0.1696 | 17.4699 | 0.000 | |||
Model |
Model |
Measures of MROC curve and optimal cutpoint - simulation datasets
Dataset | Model | Opt c | (1 − |
AUC |
---|---|---|---|---|
Simulation 1 | Full | −1.0277 | (0.3918, 0.6081) | 0.6486 |
Stepwise 1 | −1.0277 | (0.3918, 0.6081) | 0.6486 | |
Stepwise 2 | −1.0277 | (0.3918, 0.6081) | 0.6486 | |
Simulation 2 | Full | −1.1141 | (0.4007, 0.5992) | 0.6373 |
Stepwise 1 | −1.1784 | (0.4006, 0.5993) | 0.6372 | |
Stepwise 2 | −1.0570 | (0.4010, 0.5989) | 0.6367 |
MROC = multivariate receiver operating characteristic; AUC = area under the curve.