Assume that the set
Classification results can be expressed by a confusion matrix. The true positive (TP) and true negative (TN) in the confusion matrix represent the numbers of correctly classified diseased and non-diseased populations. The true positive rate (TPR, sensitivity) and true negative rate (TNR) have the following relationship: for any threshold (cut-off point)
However, the false positive (FP) and false negative (FN) are the numbers of populations that predicted disease as non-disease, and predicted non-disease as disease, respectively. The false positive rate (FPR, 1 – specificity) and false negative rate (FNR) have the following relationship:
The total number of disease groups,
The ‘sensitivity’ and ‘specificity’ are the basic accuracy measures of diagnosis. Whereas the ‘sensitivity’ is the probability that a patient with an actual disease will be positive (
When the evaluation result of the disease is positive or negative, the probability of the test performance that an actual health condition is disease or non-disease may be required, rather than how accurate the diagnosis is. There exist two other measures: the positive predictive value (PPV) and negative predictive value (NPV) suggested by Raslich
The neutral zone was proposed to ensure the retesting of diseases with high diagnostic costs. Both the PPV and NPV were also derived when the neutral zone is present (Daniel and Steven, 2016).
Based on statistical decision theory, the ROC curve is a visual tool that can easily identify the classifier’s performance in binary classification (Egan, 1975; Zweig and Campbell, 1993; Bradley, 1997; Fawcett and Provost, 1997; Pepe, 2000; Hong and Lee, 2018). Since the ROC curve is implemented with (FPR, TPR) = (1 – specificity, sensitivity) in a unit length square, the ‘sensitivity’ and ‘specificity’ can be represented by the ROC curve. In contrast, the PPV and NPV cannot be represented by the ROC curve. Pontius and Si (2014) proposed the total operating characteristic (TOC) curve, which is expressed with the numbers of observations of TP, FN, FP, and TN. The TOC curve is implemented as a curve in a parallelogram with a hypotenuse slope of 45 degrees, belonging to a rectangle, where the lengths of the horizontal and vertical axes are
In this study, we investigate how the PPV and NPV can be represented by the TOC curve, neither of which can be expressed by the ROC curve. We explain their meanings in terms of geometry, then both the PPV and NPV are examined in relation to the shape of the TOC curve. New explanatory methods are proposed to geometrically describe the PPV and NPV by the TOC curve. In addition, the PPV and NPV are also geometrically described and implemented by the TOC curve, when the neutral zone is present. We discuss these relationships between the descriptions of the PPV and NPV by the TOC curve mathematically.
Section 2 of this paper introduces the TOC curve, explains definitions of the PPV and NPV as well as geometrically discusses the representations by the TOC curve of both the PPV and NPV, so that their representations by the TOC curve could be explored. Section 3 discusses and derives the PPV and NPV with probability equations when the neutral zone is present; expresses the PPV and NPV in this case by the TOC curve, and then geometrically compares this description of the TOC curve with those represented in Section 2. Section 4 generates two random samples representing disease and non-disease states with different sample sizes, then obtains and discusses the values of PPV and NPV with interpretations derived in this work. This section also generates two random samples in the presence of the neutral zone with some kinds of two types of errors, and discusses and explores some results based on the neutral zone. Finally, some conclusions are summarized in Section 5.
If two samples are skewed, imbalanced data or very different in size,
Figure 1 shows a TOC plot. The unit on both the horizontal and vertical axes of the TOC plot indicates the number of observations. The horizontal axis ranges from 0 to
From any point
Both conditional probabilities PPV and NPV of the actual disease or non-disease state given a positive or negative diagnosis result in ( 1.1), can be expressed with frequencies of the confusion matrix as shown in ( 2.1). For any threshold
The denominators of PPV(
For any threshold
PPV(
1 – NPV(
where the coordinate of
Since both the ROC curve and the TOC curve are concave, both PPV(
The slope from the origin to the rightmost coordinate (
For any threshold
Since PPV(
Sometimes, one of the rates FPR and FNR can be viewed as more important to control. However, it is not possible to choose a point on the ROC curve that achieves the targeted values for both FPR and FNR. In addition, difficult classification problems incur the risk of high FPR and/or high FNR when the group features are not very different between the two groups. In these situations, the most desirable solution is to identify features that have higher discrimination ability. Hence, Daniel and Steven (2016) proposed neutral zone classifiers that add a soft classification outcome, “neutral” in order to handle ambiguous cases and allow the control of both FPR and FNR. Consider the following classifier:
Two constants
The solutions for the two constants are
Both NZR_{0} and NZR_{1} are denoted as probabilities for the neutral zone,
The NZ_{0} and NZ_{1} were defined as NZ_{0} =
Daniel and Steven (2016, p.2350) defined both PPV =
Note that both the PPV and NPV in ( 3.2) depend on
Suppose that
It is also found that subtracting NZ_{1} from the vertical height of
The PPV and NPV in the presence of the neutral zone could be also expressed as in Result 3 by using ( 3.3).
For any two thresholds
Note that whereas both the PPV and NPV in ( 3.2) depend on
We also conclude that both the PPV and NPV in the presence of the neutral zone could be explained as the slopes of two right-angled triangles. Based on the coordinate on the TOC curve, TP(
With analogous arguments in Section 2, the denominator of 1 – NPV(
For any two thresholds
PPV(
1 – NPV(
where
Note that the coordinates of
For
Let the distribution function representing a disease state be a normal distribution function with mean
For each case, the coordinate of point
To explore the presence of the neutral zone explained in Section 3, set
For the first case, area under the curve (AUC) = 0.76025, so that the discriminate power is adequate. Based on Table 1 (a), it can be found that as the value of
Equation (2.2) is also confirmed for
From Table 1 (b), when
Figures 3 (a) and (b) demonstrate the TOC curve, points
Case 2 has a different mean value of the distribution function for disease state, compared with Case 1. For Case 2, AUC = 0.9214, which has better discriminate power than Case 1, so that points
Case 3 has the same mean value and different sample sizes of
Case 4 has different sample sizes of
A credit evaluation data was collected by a Korean domestic
Based on this real data in Table 5, Figure 5 (a) represents the ROC curve. The AUC for this ROC curve is 0.9622. It can be obtained that sensitivity = 0.8371 and specificity = 1 – 0.0499 = 0.9500, corresponding to an optimal threshold at RS rank=16. These values can also be obtained from Table 6 (a): sensitivity = 997
Figure 5 (b) expresses the TOC curve since the PPV and NPV cannot be explained on the ROC curve. Figure 5 (b) shows that the values of PPV and NPV outside
The neutral zone can be set from
The ‘sensitivity’ and ‘specificity’ are well known measures for evaluating the accuracy and performance of diagnosis. When the evaluation result of the disease is positive or negative, the probability of the test performance that an actual health condition is disease or non-disease may be more important than how accurate the diagnosis is. To satisfy this interest, there are two other measures such as the PPV and NPV proposed by Zhou
The ‘sensitivity’ and ‘specificity’ can be described on the ROC curve; however, PPV and NPV cannot be represented by the ROC curve. In this paper, some explanatory methods are proposed to describe the PPV and NPV by the TOC curve, which can be geometrically explained by the TOC curve. The coordinate of a certain point corresponding to any threshold
Therefore, it may be concluded that PPV(
The neutral zone was suggested to ensure the retesting of diseases with high diagnostic costs. Daniel and Steven (2016) derived definitions of both the PPV and NPV when the neutral zone is present. When the neutral zone is present, the PPV and NPV are also geometrically described and implemented on the TOC curve in this study. The two constants
PPV and NPV by the TOC curve. PPV = positive predictive value; NPV = negative predictive value; TOC = total operating characteristic.
PPV and NPV with neutral zone by the TOC curve. PPV = positive predictive value; NPV = negative predictive value; TOC = total operating characteristic.
Total operating characteristic curves for Case 1.
Total operating characteristic curves for Case 4.
ROC curve and TOC curve for real data. ROC = receiver operating characteristic; TOC = total operating characteristic.
Results of PPV and NPV for Case 1
(a) Results corresponding to |
|||
---|---|---|---|
PPV( |
NPV( |
||
0.70 | (110, 62) | 0.56080 | 0.79871 |
0.60 | (120, 66) | 0.54440 | 0.80815 |
0.50 | (131, 69) | 0.52842 | 0.81759 |
0.40 | (141, 73) | 0.51293 | 0.82698 |
0.30 | (152, 76) | 0.49798 | 0.83627 |
(b) Results with |
|||||||
---|---|---|---|---|---|---|---|
PPV( |
NPV( |
||||||
0.05 | 0.10 | 1.6449 | (36, 26) | 0.7219 | −0.2816 | (212, 90) | 0.8861 |
0.10 | 0.05 | 1.2816 | (59, 39) | 0.6605 | −0.6448 | (243, 95) | 0.9121 |
PPV = positive predictive value; NPV = negative predictive value.
Results of PPV and NPV for Case 2
(a) Results corresponding to |
|||
---|---|---|---|
PPV( |
NPV( |
||
1.2 | (102, 19) | 0.77399 | 0.89309 |
1.1 | (109, 82) | 0.75045 | 0.90377 |
1.0 | (116, 84) | 0.72614 | 0.91384 |
0.9 | (123, 86) | 0.70131 | 0.92325 |
0.8 | (131, 88) | 0.67622 | 0.93197 |
(b) Results with |
|||||||
---|---|---|---|---|---|---|---|
PPV( |
NPV( |
||||||
0.05 | 0.10 | 1.6449 | (74, 64) | 0.8646 | 0.7285 | (137, 90) | 0.9386 |
0.10 | 0.05 | 1.2816 | (96, 76) | 0.7925 | 0.3552 | (167, 95) | 0.9623 |
PPV = positive predictive value; NPV = negative predictive value.
Results of PPV and NPV for Case 3
(a) Results corresponding to |
|||
---|---|---|---|
PPV( |
NPV( |
||
0.70 | (148, 124) | 0.83627 | 0.49798 |
0.60 | (159, 131) | 0.82698 | 0.51293 |
0.50 | (169, 138) | 0.81759 | 0.52842 |
0.40 | (180, 145) | 0.80815 | 0.54440 |
0.30 | (190, 152) | 0.79871 | 0.56080 |
(b) Results with |
|||||||
---|---|---|---|---|---|---|---|
PPV( |
NPV( |
||||||
0.05 | 0.10 | 1.6449 | (57, 52) | 0.9121 | −0.2816 | (241, 180) | 0.6605 |
0.10 | 0.05 | 1.2816 | (88, 78) | 0.8861 | −0.6448 | (264, 190) | 0.7218 |
PPV = positive predictive value; NPV = negative predictive value.
Results of PPV and NPV for Case 4
(a) Results corresponding to |
|||
---|---|---|---|
PPV( |
NPV( |
||
1.2 | (169, 158) | 0.93197 | 0.67622 |
1.1 | (177, 163) | 0.92325 | 0.70131 |
1 | (184, 168) | 0.91384 | 0.72614 |
0.9 | (191, 173) | 0.90377 | 0.75045 |
0.8 | (198, 177) | 0.89309 | 0.77399 |
(b) Results with |
|||||||
---|---|---|---|---|---|---|---|
PPV( |
NPV( |
||||||
0.05 | 0.10 | 1.6449 | (133, 128) | 0.9626 | 0.7185 | (204, 180) | 0.7925 |
0.10 | 0.05 | 1.2816 | (163, 153) | 0.9386 | 0.3552 | (226, 190) | 0.8646 |
PPV = positive predictive value; NPV = negative predictive value.
Real data with risk score rank and stages
Risk score rank | Second stage | Third stage | Sum |
---|---|---|---|
1 | 4 | 0 | 4 |
2 | 1 | 0 | 1 |
3 | 17 | 0 | 17 |
4 | 55 | 0 | 55 |
5 | 56 | 0 | 56 |
6 | 175 | 0 | 175 |
7 | 189 | 0 | 189 |
8 | 279 | 0 | 279 |
9 | 518 | 0 | 518 |
10 | 421 | 0 | 421 |
11 | 875 | 10 | 885 |
12 | 761 | 7 | 768 |
13 | 631 | 34 | 665 |
14 | 1102 | 0 | 1102 |
15 | 555 | 92 | 647 |
16 | 965 | 51 | 1,016 |
17 | 347 | 64 | 411 |
18 | 0 | 306 | 306 |
19 | 0 | 111 | 111 |
20 | 0 | 516 | 516 |
sum | 6,951 | 1,191 | 8,142 |
Confusion matrix in optimal threshold
(a) At RS rank = 16 |
||
---|---|---|
Real non-default | Real default | |
Non-default | 6604 | 194 |
Default | 347 | 997 |
(b) At RS rank = 16 and RS rank = 11 |
||
---|---|---|
Real non-default | Real default | |
Non-default | 2590 | 10 |
Neutral zone | 4014 | 184 |
Default | 347 | 997 |
RS = risk score.