search for

CrossRef (0)
L1-penalized AUC-optimization with a surrogate loss
Communications for Statistical Applications and Methods 2024;31:203-212
Published online March 31, 2024
© 2024 Korean Statistical Society.

Hyungwoo Kim1,a, Seung Jun Shinb

aDepartment of Statistics and Data Science, Pukyong National University, Korea;
bDepartment of Statistics, Korea University, Korea
Correspondence to: 1 Department of Statistics and Data Science, Pukyong National University, 45 Yongso-ro, Nam-gu, Busan 04853, Korea. E-mail: khw7682@pknu.ac.kr
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00242528).
Received January 4, 2024; Revised February 22, 2024; Accepted February 23, 2024.
The area under the ROC curve (AUC) is one of the most common criteria used to measure the overall performance of binary classifiers for a wide range of machine learning problems. In this article, we propose a L1-penalized AUC-optimization classifier that directly maximizes the AUC for high-dimensional data. Toward this, we employ the AUC-consistent surrogate loss function and combine the L1-norm penalty which enables us to estimate coeffcients and select informative variables simultaneously. In addition, we develop an effcient optimization algorithm by adopting k-means clustering and proximal gradient descent which enjoys computational advantages to obtain solutions for the proposed method. Numerical simulation studies demonstrate that the proposed method shows promising performance in terms of prediction accuracy, variable selectivity, and computational costs.
Keywords : AUC-optimization, AUC consistency, variable selection, L1-norm penalty, clustering and proximal gradient descent
1. Introduction

Classifying the binary data is a fundamental task in the machine learning community, where the primary goal is to categorize data into one of two classes. Most existing classification algorithms are designed to maximize the accuracy (or equivalently minimize the error rate), however, it may be a misleading performance metric for imbalanced data with few samples in the minority class since they classify all examples into the majority class by achieving a high classification accuracy.

A receiver operation characteristic (ROC) analysis originated from signal detection theory (Egan, 1975) can be a great alternative for imbalanced classification problems and it has become the standard methodology to evaluate the model performance of the binary classifier in various fields such as psychology, medicine, bioinformatics, and machine learning. The ROC curve is a graphical plot that features a true positive rate (TPR) on the Y axis and a false positive rate (FPR) on the X axis. It is particularly useful for imbalanced data due to its diagnostic ability for evaluating the classifier’s performance over a variety of decision thresholds.

As a summary measure of the ROC curve, the area under the ROC curve (AUC) is the most popular scalar metric in classification. It equals to the probability that a classifier will rank a randomly drawn positive example higher than a randomly drawn negative example, which is closely related to the bipartite ranking problem (Agarwal et al., 2005; Clémençon et al., 2008; Menon and Williamson, 2016). Namely, the population AUC of a classifier f is defined as


where X+ and X denote a p-dimensional input vector that belongs to the binary class Y = +1 and Y = −1, respectively. The perfect classifier f takes the value of 1 for AUC while the random guess does 0.5.

It is noteworthy that a high AUC value represents the better classification performance of the classifier, and thus a huge amount of studies have been devoted to directly maximizing the AUC in the two past decades. They include various forms such as linear models (Brefeld and Scheffer, 2005; Ying et al., 2016; Norton and Uryasev, 2019), kernel models (Rakotomamonjy, 2004; Ataman et al., 2006), and neural networks (Liu et al., 2019; Yuan et al., 2021). In addition, several AUC-optimization algorithms have been developed under different loss functions (Rakotomamonjy, 2004; Zhang et al., 2012; Natole et al., 2018; Yang et al., 2020; Lei and Ying, 2021). Due to the non-continuity of AUC (1.1) in sample level, most learning algorithms for AUC maximization resort to a convex relaxation of the loss function such as hinge (Rakotomamonjy, 2004) for support vector machine (SVM), exponential (Clémençon et al., 2013) for boosting, least square (Natole et al., 2018), and logistic surrogate (Yang et al., 2020). A typical choice for AUC maximization is hinge loss because of its computational advantage via quadratic or linear programming. After Rakotomamonjy (2004) introduced a SVM-type approach that directly maximizes the AUC, called the ROC-SVM, several variants of the ROC-SVM have been studied (Arzhaeva et al., 2006; Tian et al., 2011; Zhao et al., 2011; Ying et al., 2016).

Theoretical analysis to make use of surrogate loss in the AUC-maximization context has recently been examined. Gao and Zhou (2015) first introduced the notion of AUC consistency that guarantees the solution of the AUC-optimization with surrogate loss yields the one with original loss in the asymptotic sense. Moreover, Gao and Zhou (2015); Uematsu and Lee (2017) provided a sufficient condition for AUC consistency and proved that hinge loss is inconsistent with AUC because of its non-differentiability at one. To our knowledge, there lacks of study on variable selection for AUC-optimization with the AUC-consistent surrogate loss. In this work, we try to this gap by employing a logistic surrogate loss function which is statistically AUC consistent (Gao and Zhou, 2015) and combining the L1-norm penalty (Tibshirani, 1996). The proposed method performs coefficient estimation and automatic variable selection simultaneously due to the L1 penalty, which is very useful for the prediction and interpretability of the classifier f in high-dimensional data analysis.

It is well known that AUC optimization is very computationally expensive because it has a form of summation of pairwise losses over pairs of examples, while most machine learning algorithms that focus on maximizing accuracy (e.g. SVM) take the sum of pointwise losses over individual examples. As mentioned earlier above, AUC-optimization with hinge loss has the advantage of computation by using solvers via quadratic programming (Rakotomamonjy, 2004) or linear programming (Arzhaeva et al., 2006) or even regularization path algorithms (Kim and Shin, 2020; Kim et al., 2021), but such techniques are inapplicable to the proposed AUC-optimization with logistic loss. To mitigate such computational burden for the proposed method, we develop an efficient optimization algorithm by adopting k-means clustering (Duda et al., 1973; Brefeld and Scheffer, 2005) and proximal gradient descent (PGD) (Rockafellar, 1976; Combettes and Wajs, 2005), which we call CPAUC. Through the k-means clustering algorithm, we approximate the objective function from a double summation formulation to a single one, which drastically reduces the computational costs. Also, the PGD algorithm enables us to solve the non-differentiable optimization problem with a fast convergence rate. By combining two techniques, the CPAUC algorithm successively reduces the running time for the L1-penalized AUC-optimization with the logistic loss.

The remainder of this article is organized as follows. In Section 2, we briefly describe the AUC consistency and give the formulation of L1-penalized AUC optimization with the surrogate loss. We present an efficient algorithm to find the parameter of interest for the proposed method in Section 3. Numerical simulation studies and real data analysis results are presented in Sections 4 and 5, respectively. The concluding remarks follow in Section 6.

2. L1-penalized AUC optimization with logistic loss

Suppose we are given n observations with ( x i , y i ) i = 1 n, where xi ∈ 꽍p is a p-dimensional predictor and yi ∈ {−1, +1} is a binary outcome. Let us define the index set I+ = {i : yi = +1} and I = {i : yi = −1}, and the corresponding number of positive and negative samples are n+ = |I+| and n = |I|, respectively. Considering a linear classifier f (x) = β0 + βT x with βT = (β1, . . . , βp), the empirical AUC of (1) is denoted by

1 n + n - i I + j I - 1 { f ( x i ) - f ( x j ) 0 }

where 1 { A } is an indicator function that takes value 1 if A is true and 0 otherwise. To maximize (2.1), it is equivalent to minimize 1 − AUC( f ) and thus we solve the minimization problem as follows: min ( 1 / n + n - ) Σ i I + Σ j I - 1 { f ( x i ) - f ( x j ) < 0 }. However, the direct optimization problem of (2.1) is ill-posed (Feldman et al., 2012) since the objective function (2.1) is not continuous and the solution may not be unique. To make it well-posed, one may consider the convex surrogate loss and employ the regularization term as follows:

argmin β 1 n + n - i I + j I - φ { f ( x i ) - f ( x j ) } + λ J ( β ) ,

where φ{·} is a convex surrogate function, J(β) is a regularization term that prevents overfitting, and λ > 0 is a regularization strength parameter. The most popular choice of φ in the AUC maximization context is the hinge loss for SVM, which takes the form φ(z) = max{0, 1 − z}. Rakotomamonjy (2004) proposed the AUC-maximizing SVM with L2-norm penalty J(β) = βTβ. Furthermore, for the effective variable selection, Ataman et al. (2006) and Tian et al. (2011) proposed AUC-maximizing SVM with L1-norm and Lq-norm (0 < q < 1) penalty, respectively.

Gao and Zhou (2015) introduced the AUC consistency that guarantees the optimal solution of surrogate loss for (2.2) without the regularization term (i.e., λ = 0) yields an optimal solution for (2.1) in the asymptotic sense. Moreover, Gao and Zhou (2015) showed that the hinge loss for SVM is not AUC-consistent because of its non-differentiability at z = 1. In this work, we propose a AUC-optimization classifier by adopting the AUC-consistent surrogate loss and combining the L1-norm penalty as follows:

β ^ n = argmin β Q n ( β ) : = 1 n + n - i I + j I - φ { β T ( x i - x j ) } + λ k = 1 p β k ,

which we call L1-AUC surrogate estimator. For the choice of AUC-consistent surrogate function φ, we consider a popular logistic loss function that takes the form φ(z) = log(1+exp(−z)). Notice that AUC-optimizing classification (2.3) does not involve an intercept β0 and thus it is unidentifiable. However, β0 becomes identifiable if the TPR or FPR is provided in practice. We remark that βn naturally becomes a sparse estimate due to the geometry of L1-norm penalty in (2.3), which is particularly useful when the predictor is high-dimensional.

3. CPAUC algorithm

Notice that the AUC-optimization algorithm requires considerable computational expenses because the objective function Q(β) in (2.3) takes a double sum loss over pairs of examples while most machine learning algorithms based on the accuracy (e.g. SVM) is a sum of loss over individual examples. Therefore, the running time for the AUC optimization will be acceptable for small to moderate samples but undesirable for large and high dimensional data.

To find an optimal solution (2.3) with a relatively low time complexity, we propose a clustering proximal AUC maximization (CPAUC) algorithm by adopting the k-means clustering and PGD algorithm. Since the optimization (2.3) is only involved in the predictors as a pair of observations δi j = xixj with n+n vectors, we can apply the k-means clustering to reduce the number of examples by generating N( 돦 n+n) new cluster centers, denoted by a p-dimensional vector cm where m = 1, …, N. Here, the choice of the number of clusters N is crucial for a given dataset and there are several techniques such as elbow curve (Thorndike, 1953) and silhouette (Rousseeuw, 1987) for finding an optimal N. In this work, we performed small Monte-Carlo simulations and chose sufficiently large N as the number of sample size, i.e., N = n. Then, we solve the following approximated AUC-optimization problem using n clustered samples:

argmin β Q ˜ n ( β ) : = 1 n m = 1 n φ { β T c m } + λ k = 1 p β k ,

which drastically reduce the time complexity of (2.3).

To solve (3.1) effectively, we employ the PGD algorithm that has recently attracted much attention because of its fast convergence rate and ability to handle the non-differentiable objective function. For this, we represent (3.1) as the sum of smooth and non-smooth convex function, i.e., Q ˜ n ( β ) = Q ˜ n s ( β ) + λ Σ k = 1 p β k , where Q ˜ n s ( β ) = ( 1 / n ) Σ m = 1 n φ { β T c m }. The PGD algorithm iteratively updates the parameter by solving the following problem at the tth iteration:

β ^ ( t + 1 ) = Prox κ { β ^ ( t ) - κ Q ˜ n s ( β ^ ( t ) ) } ,

where β ^ ( t ) = ( β ^ 1 ( t ) , , β ^ p ( t ) ) T denotes the tth current solution and κ(> 0) plays the role of learning rate. The proximal operator in (3.2) is defined as Prox κ { β } = argmin s p ( 1 / 2 κ ) β - s 2 2 + λ s 1 and ∇ denotes the first partial derivative operator with respect to β. To summarize, the proposed algorithm CPAUC is presented in Algorithm 1.

4. Simulation

In this section, we carry out numerical experiments to investigate the superior performance of the proposed method in terms of prediction accuracy, variable selection, and computing time. All experiments are implemented with 3.4GHz Intel Core(TM), 64GB RAM and 64-bit R version 4.3.1.

Algorithm 1 : Clustering proximal AUC maximization (CPAUC)

Input: y, κ, λ, and standardized x.
Output: An approximated solution (3.1)
1: Clustering: Apply the k-means clustering for δi j with n+n examples to generate n centered clusters.
2: Proximal gradient descent:
 (a) Initialize β(0) = 0 with t = 0
 (b) repeat Compute  β ^ ( t + 1 ) Prox κ { β ^ ( t ) - κ Q n s ( β ^ ( t ) ) } t t + 1
  until convergence

4.1. Sample performance

To examine the prediction accuracy and variable selection ability of the proposed method, we compare it with the popular binary classification methods: L2-SVM (Cortes and Vapnik, 1995), L1-SVM (Zhu et al., 2003), L2-ROCSVM (Rakotomamonjy, 2004), L1-ROCSVM (Kim et al., 2021), and L2-ROCLOG. In this work, we denote the L2-penalized AUC-optimization with logistic loss by L2-ROCLOG and the proposed method using all examples and n-clustered samples by L1-ROCLOG1 and L1-ROCLOG2, respectively.

We consider the imbalanced scenario with the proportion of positive class as 0.9, i.e., 꽇(Y = +1) = 1−꽇(Y = −1) = 0.9. For given the class label Y, we generate the predictor X from multivariate normal (MN) distribution with MN(μ, ∑) for the positive class and MN(−μ, ∑) for the negative class, where a mean vector μ = ( z k T , 0 p - k T ) T and covariance structure ∑. The data is generated from the following four models:

  • (M1) μ T = ( 0.8 , - 0.8 , 0.8 , 0 p - 3 T ) and ∑ = 2Ip.

  • (M2) μ T = ( 0.8 , - 0.8 , 0.8 , 0 p - 3 T ) and ∑ = (σi j) with its non-zero elements σii = 2 for i = 1, . . . , p and σi j = 0.3|ij| for 1 ≤ ij ≤ 3.

  • (M3) μ T = ( 1 , - 0.8 , 0.6 , - 0.4 , 0.2 , 0 p - 5 T ) and ∑ = 2Ip.

  • (M4) μ T = ( 1 , - 0.8 , 0.6 , - 0.4 , 0.2 , 0 p - 5 T ) and ∑ = (σi j) with its non-zero elements σii = 2 for i = 1, . . . , p and σi j = 0.3|ij| for 1 ≤ ij ≤ 5.

Notice that (M1)–(M4) present the sparse models where the first k components are only informative. For (M2) and (M4), we set the pairwise correlation between xi and xj to be 0.3|ij| for 1 ≤ ijk. We consider a different combination of the training sample size and the predictor dimension with (n, p) ∈ {100, 200} × {20, 40}. For a fair comparison with tuning parameter selection, we implement a 5-fold cross-validation (CV) over the interval 1.8{−8,−7,...,1,2} and then evaluate each classifier’s performance based on the 500 independent test samples generated from (M1)–(M4). We also examine the variable selection ability by identifying the number of correctly selected relevant variables (CS) and incorrectly selected (ICS) irrelevant variables, respectively.

Table 1 reports the averaged test AUC computed from seven classifiers over 100 independent repetitions. It is shown that all classification methods get better as the sample size increases and worse the predictor dimension increases. One can observe that AUC-maximization algorithms consistently yield better results than those based on the accuracy within the same penalty function. It is not surprising that L1-penalized classifiers lead to great prediction power compared to the L2-penalized ones by removing noise variables, especially as the predictor dimension increases. We note that ROCLOG works slightly better than ROCSVM with hinge loss, which is known to be inconsistent with AUC. The results for L1-ROCLOG1 and L1-ROCLOG2 are comparable in terms of AUC, but the latter is significantly more computationally efficient compared with the former, which will be discussed in detail in Section 4.2.

Table 2 summarizes the variable selection results of L1-penalized classifiers in terms of CS and ICS over 100 independent repetitions. It is observed that L1-SVM does worst in terms of ICS for all models. The variable selection performance of L1-ROCSVM and L1-ROCLOG2 gets better as the sample size increases. We can see that the results of AUC-optimization classifiers are comparable, but L1-ROCSVM slightly works better than L1-ROCLOG2 in terms of ICS.

4.2. Computing time

In this subsection, we show the advantage of the proposed CPAUC algorithm in terms of computing time. We compare it with the one based on the PGD algorithm using all observations (without clustering), which we call PAUC. The simulation settings are identical to (M1) in Section 4.1. We reported the running time in the setting of increasing sample size and predictor dimension with (n, p) ∈ {100, 200, 300, 400} × {20, 40}.

Figure 1 denotes the comparison of running time (in seconds) between CPAUC and PAUC over 100 independent repetitions. The regularization parameter λ is chosen with the maximum AUC value computed from 5-CV. One can observe that the running time of CPAUC is significantly reduced after applying the k-means clustering algorithm. We also remark that CPAUC becomes more computationally efficient than PAUC as the sample size and the predictor dimension increase.

5. Real data analysis

In this section, we apply the proposed method to the Pima Indians Diabetes (PID) data set from the national institute of diabetes and digestive and kidney disease, which is available at Kaggle and UCI Machine Learning Repository. The main purpose of this study is to accurately predict whether a Pima Indian female has diabetes or not. This data set contains 768 participants (268 diabetic and 500 non-diabetic) with 8 real-valued input variables.

We first split PID data as 70% train and 30% test set, respectively. We applied L2-SVM, L1-SVM, L2-ROCLOG, and L1-ROCLOG described in Section 4.1 to the train set and then computed the AUC using test set. The optimal regularization parameter for each classifier is chosen by a 5-fold CV. In this work, we excluded ROCSVM classifiers because they fail to produce the solutions from both quadratic programming and regularization paths in R-package.

Figure 2 presents the test ROC curves of the four classifiers and Table 3 reports the corresponding test AUC and the number of selected predictors. It is observed that the proposed method attains the highest test AUC with a minimal number of predictors.

6. Concluding remarks

In this paper, we consider the L1-penalized AUC optimization with logistic loss that directly maximizes the AUC. We also develop an efficient algorithm by adopting the k-means clustering and PGD algorithm to find the solution of the proposed method. Through numerical experiments, we demonstrate the promising performance of the proposed method in terms of prediction accuracy, variable selection, and running time compared with the existing classifiers.

In addition to the linear classifier for AUC-maximization with the logistic loss, the extension to nonlinear classifiers such as the tree-based model, generalized additive model, kernel model, or even deep neural network model will be possible. It is very challenging due to its high computational complexity, but we believe it would be interesting for further research. Another challenging task is that most clustering algorithms suffer from the curse of dimensionality, which may deteriorate the performance of the CPAUC algorithm when the predictor dimension is very large. To avoid this, we will try to adopt dimension reduction clustering techniques (Bouveyron and Brunet-Saumard, 2014; Cohen et al., 2015) for the proposed method, which requires further investigation.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00242528).
Fig. 1. Comparison of running time for PAUC and CPAUC.
Fig. 2. The comparison of the ROC curve with four classifiers. The L1-penalized ROCLOG outperforms all others.

Algorithm 1

Clustering proximal AUC maximization (CPAUC)

Input:y, κ, λ, and standardized x.
Output: An approximated solution (3.1)
1:Clustering: Apply the k-means clustering for δi j with n+n examples to generate n centered clusters.
2:Proximal gradient descent:
 (a) Initialize β̂(0) = 0 with t = 0
 (b) repeatCompute β^(t+1)Proxκ{β^(t)-κQns(β^(t))}tt+1
  until convergence

Table 1

The average of test AUC over 100 independent repetitions are reported for (M1)–(M4)

(M1)100200.830 (0.05)0.852 (0.05)0.820 (0.05)0.857 (0.06)0.851 (0.04)0.870 (0.04)0.868 (0.04)
400.776 (0.05)0.828 (0.05)0.759 (0.06)0.824 (0.07)0.809 (0.05)0.850 (0.05)0.849 (0.04)

200200.869 (0.04)0.879 (0.04)0.854 (0.05)0.886 (0.05)0.884 (0.04)0.891 (0.04)0.893 (0.04)
400.827 (0.03)0.861 (0.03)0.804 (0.05)0.886 (0.04)0.856 (0.03)0.889 (0.04)0.885 (0.04)

(M2)100200.856 (0.04)0.876 (0.04)0.833 (0.05)0.878 (0.05)0.869 (0.04)0.889 (0.04)0.886 (0.04)
400.803 (0.05)0.865 (0.05)0.798 (0.05)0.876 (0.06)0.836 (0.04)0.882 (0.05)0.879 (0.06)

200200.896 (0.03)0.907 (0.03)0.884 (0.03)0.919 (0.02)0.906 (0.02)0.923 (0.02)0.923 (0.02)
400.855 (0.03)0.895 (0.03)0.845 (0.04)0.909 (0.03)0.870 (0.04)0.911 (0.03)0.909 (0.03)

(M3)100200.854 (0.04)0.887 (0.04)0.856 (0.05)0.869 (0.05)0.874 (0.04)0.885 (0.04)0.882 (0.04)
400.807 (0.03)0.854 (0.03)0.797 (0.05)0.865 (0.04)0.843 (0.03)0.875 (0.04)0.874 (0.04)

200200.890 (0.03)0.895 (0.03)0.871 (0.03)0.901 (0.03)0.902 (0.03)0.908 (0.03)0.905 (0.03)
400.857 (0.03)0.881 (0.03)0.816 (0.04)0.898 (0.04)0.882 (0.03)0.901 (0.03)0.898 (0.03)

(M4)100200.894 (0.02)0.903 (0.02)0.887 (0.02)0.896 (0.03)0.902 (0.03)0.910 (0.03)0.909 (0.04)
400.835 (0.03)0.886 (0.03)0.827 (0.03)0.888 (0.04)0.865 (0.03)0.891 (0.03)0.888 (0.04)

200200.919 (0.02)0.921 (0.02)0.901 (0.03)0.928 (0.02)0.926 (0.02)0.931 (0.02)0.929 (0.02)
400.886 (0.03)0.915 (0.03)0.852 (0.04)0.922 (0.02)0.899 (0.03)0.925 (0.02)0.920 (0.02)

The numbers in the parentheses are standard errors.

Table 2

The average of CS and ICS over 100 independent repetitions are reported for (M1)–(M4)










Table 3

The number of selected predictors and test AUC for employed four classifiers

Number of selected variablesTest AUC
L2-SVM80.832 (0.025)
L1-SVM70.843 (0.023)
L2-ROCLOG80.837 (0.025)
L1-ROCLOG250.856 (0.024)

Standard errors are reported in the parentheses over 100 independent repetitions.

  1. Agarwal S, Graepel T, Herbrich R, Har-Peled S, Roth D, and Jordan MI (2005). Generalization bounds for the area under the ROC curve. Journal of Machine Learning Research, 6, 393-425.
  2. Arzhaeva Y, Duin RP, and Tax D (2006). Linear model combining by optimizing the area under the ROC curve. Proceedings of 18th International Conference on Pattern Recognition (ICPR06). 4, IEEE, 119-122.
  3. Ataman K, Street WN, and Zhang Y (2006). Learning to rank by maximizing AUC with linear programming. The 2006 IEEE International Joint Conference on Neural Network Proceedings. , IEEE, 123-129.
  4. Bouveyron C and Brunet-Saumard C (2014). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis, 71, 52-78.
  5. Brefeld U and Scheffer T (2005). AUC maximizing support vector learning. Proceedings of the ICML 2005 Workshop on ROC Analysis in Machine Learning. .
  6. Cl챕men챌on S, Depecker M, and Vayatis N (2013). An empirical comparison of learning algorithms for nonparametric scoring: The treerank algorithm and other methods. Pattern Analysis and Applications, 16, 475-496.
  7. Cl챕men챌on S, Lugosi G, and Vayatis N (2008). Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36, 844-874.
  8. Cohen MB, Elder S, Musco C, Musco C, and Persu M (2015). Dimensionality reduction for means clustering and low rank approximation. Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing. , 163-172.
  9. Combettes PL and Wajs VR (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4, 1168-1200.
  10. Cortes C and Vapnik V (1995). Support-vector networks. Machine Learning, 20, 273-297.
  11. Duda RO, Hart PE, and Stork DG (1973). Pattern Classification and Scene Analysis, New York, Wiley.
  12. Egan JP (1975) Signal detection theory and ROC analysis . (No Title),
  13. Feldman V, Guruswami V, Raghavendra P, and Wu Y (2012). Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41, 1558-1590.
  14. Gao W and Zhou Z-H (2015). On the consistency of AUC pairwise optimization. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence. .
  15. Kim D and Shin SJ (2020). The regularization paths for the ROC-optimizing support vector machines. Journal of the Korean Statistical Society, 49, 264-275.
  16. Kim H, Sohn I, and Shin SJ (2021). Regularization paths of L1-penalized ROC curve-optimizing support vector machines. Stat, 10, e400.
  17. Lei Y and Ying Y (2021). Stochastic proximal AUC maximization. The Journal of Machine Learning Research, 22, 2832-2876.
  18. Liu M, Yuan Z, Ying Y, and Yang T (2019). Stochastic auc maximization with deep neural networks.
  19. Menon AK and Williamson RC (2016). Bipartite ranking: A risk-theoretic perspective. The Journal of Machine Learning Research, 17, 6766-6867.
  20. Natole M, Ying Y, and Lyu S (2018). Stochastic proximal algorithms for AUC maximization. International Conference on Machine Learning, PMLR, 80, 3710-3719.
  21. Norton M and Uryasev S (2019). Maximization of auc and buffered auc in binary classification. Mathematical Programming, 174, 575-612.
  22. Rakotomamonjy A (2004). Optimizing area under Roc curve with SVMs. ROCAI, 71-80.
  23. Rockafellar RT (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14, 877-898.
  24. Rousseeuw PJ (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.
  25. Thorndike RL (1953). Who belongs in the family?. Psychometrika, 18, 267-276.
  26. Tian Y, Shi Y, Chen X, and Chen W (2011). AUC maximizing support vector machines with feature selection. Procedia Computer Science, 4, 1691-1698.
  27. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58, 267-288.
  28. Uematsu K and Lee Y (2017). On theoretically optimal ranking functions in bipartite ranking. Journal of the American Statistical Association, 112, 1311-1322.
  29. Yang Z, Shen W, Ying Y, and Yuan X (2020). Stochastic AUC optimization with general loss. Communications on Pure & Applied Analysis, 19, 4191-4212.
  30. Ying Y, Wen L, and Lyu S (2016). Stochastic online AUC maximization. Advances in Neural Information Processing Systems, 29.
  31. Zhang X, Saha A, and Vishwanathan S (2012). Smoothing multivariate performance measures. The Journal of Machine Learning Research, 13, 3623-3680.
  32. Zhao P, Hoi SC, Jin R, and Yang T (2011) Online AUC maximization .
  33. Zhu J, Rosset S, Tibshirani R, and Hastie T (2003). 1-norm support vector machines. Advances in Neural Information Processing Systems, 16.