In binary classification, we are given a set of data (
However, we often encounter cases where only a few observations have known labels, and most of them have unknown labels. That is, we are given two sets of data, labeled data ? = {(
In this article, we provide a selective but systematic review of existing semi-supervised classification methods, and then propose a simple but efficient semi-supervised classification algorithm based on the support vector machine (SVM; Vapnik, 2015). To be more precise, we apply the
The rest of articles is organized as follows. In Section 2, we provide a selective overview of semi-supervised learning methods. In Section 3, the proposed algorithm based on
The self-training algorithm (Yarowsky, 1995) first estimates
Self-training Algorithm
1. Train a classifier from the labeled data set, ? denoted by |
2. For |
2-1. Predict |
2-2. Train a classifier from combined samples, , denoted by |
2-3. If the estimated labels become stable, |
3. Return |
A generative model tries to uncover the data generating process, the distribution of (
We focus on the Gaussian mixture model (GMM) the most popular one in a statistical community. Without loss of generality, we consider binary classification with
where
TheGMMis even more prevalent in unsupervised settings, when there is no labeled data available. The parameter estimation is not difficult thanks to the well-known Expectation-Maximization (EM) algorithm (Dempster
Similar to the unsupervised case, the EM algorithm can be employed to find MLE. Given the MLE of
One advantage of the generative model is that it uncovers the data generating process of (
Semi-supervised support vector machine (S^{3}VM; Chapelle
where
Despite its conceptual simplicity, (
On the other hand, the continuous optimization modifies (
with respect to (
Both the self-training algorithm and S^{3}VM introduced in Section 2 requires the cluster assumption, which states that two points
The cluster assumption implies that observations in should be clustered, which motivates one to develop a simple clustering to predict their labels. There are several popular clustering methods in statistical learning. Among many others, the density-based spatial clustering of applications with noise (DBSCAN; Ester
After clustering, for each cluster, we count the frequency of the observed labels and then assign the class label corresponding to the maximum frequency. That is, for the
where
Finally, we can use both ? and for learning classification rule since is properly labeled. The proposed algorithm is summarized in the following. Due to the cluster assumption in (Chapelle
Algorithm 1 - Linear learning
There are several minor issues to complete the algorithm, which shall be discussed in the following. First, we need to choose a proper kernel function for both KPCA and SVM. It is natural to use the same kernel function for them, and we employ the radial basis kernel
Algorithm 2 - Nonlinear learning
In this section, we conduct several simulation studies to evaluate the prediction performance of the proposed algorithm. The existing methods reviewed in Section 2 are considered as competing methods. That is, we include the self-training algorithm, the generative models based on the EM algorithm. We use
We generate 500 training sample size for with(
For linear classification, we consider the following scenarios:
A1 (Binary classification) Let
A2 (Multi-class classification) Let
Table 1 contains the test prediction accuracy for different methods, averaged over one hundred independent repetitions. The numbers in parentheses are corresponding standard deviations. Although all methods considered show comparable performance, one can observe that the proposed method outperforms as
The following models are considered to evaluate the proposed nonlinear algorithm.
B 1 (Binary classification) Let
B 2 (Multi-class classification) Let
Table 2 shows the results for the nonlinear models. The generative model fails since the model assumption is violated, and other methods perform reasonably well. Again, the proposed method outperforms all others, just like the linear cases.
We also compare the computing time of the proposed algorithm. Figure 2 depicts the averaged computing time in seconds over the 10 repetitions for model (A1) and (B1), for different sample sizes
As a final showcase, we apply the proposed method to
We also apply the proposed method to Wisconsin diagnostic breast cancer (WDBC) available in UCI machine learning repository, one of the most popular data sets for binary classification. This dataset is composed of a total of 569 observations with benign and malign cases being 357 and 212 observations, and 30 real-valued covariates. To generate unlabeled data, we randomly choose 90% of samples and remove their labels. We then randomly split the data set into the training and test set with equal sizes.
Finally, we apply the proposed algorithm to the WDBC data, and Figure 2 depicts the test ROC curve of the proposed method. We also provide two more ROC curves of supervised SVMs, one from the original data without missing labels (denoted by oracle) and another from labeled data only (denoted by supervised SVM).We remark that the oracle represents an ideal but unrealistic result that any semi-supervised methods cannot beat. However, one can see that the proposed method shows much-improved performance in terms of test ROC curve than the supervised SVM using observed label only.
In this paper, we proposed a simple algorithm for semi-supervised classification based on the SVM and the
Linear classification – averaged prediction accuracy over 100 independent repetitions are reported under model A1 and A2. The numbers in parentheses are the corresponding standard deviations
Model | Self-training | Generative model | SVM | Proposed method | ||
---|---|---|---|---|---|---|
100 | 0.80 | 0.928 (0.02) | 0.927 (0.05) | 0.926 (0.03) | 0.926 (0.03) | |
0.90 | 0.912 (0.05) | 0.897 (0.11) | 0.900 (0.05) | 0.922 (0.03) | ||
0.95 | 0.888 (0.05) | 0.841 (0.17) | 0.882 (0.05) | 0.903 (0.05) | ||
500 | 0.80 | 0.931 (0.01) | 0.928 (0.04) | 0.925 (0.01) | 0.927 (0.02) | |
0.90 | 0.913 (0.05) | 0.896 (0.10) | 0.899 (0.04) | 0.914 (0.03) | ||
0.95 | 0.878 (0.04) | 0.834 (0.17) | 0.872 (0.03) | 0.900 (0.05) | ||
1000 | 0.80 | 0.932 (0.01) | 0.928 (0.04) | 0.927 (0.01) | 0.926 (0.01) | |
0.90 | 0.915 (0.05) | 0.898 (0.10) | 0.901 (0.04) | 0.917 (0.02) | ||
0.95 | 0.877 (0.03) | 0.835 (0.17) | 0.870 (0.03) | 0.906 (0.04) | ||
100 | 0.80 | 0.898 (0.04) | 0.868 (0.08) | 0.881 (0.04) | 0.885 (0.04) | |
0.90 | 0.818 (0.04) | 0.786 (0.13) | 0.821 (0.04) | 0.865 (0.04) | ||
0.95 | 0.784 (0.06) | 0.656 (0.11) | 0.789 (0.06) | 0.814 (0.08) | ||
500 | 0.80 | 0.897 (0.03) | 0.866 (0.07) | 0.888 (0.03) | 0.897 (0.02) | |
0.90 | 0.857 (0.02) | 0.774 (0.12) | 0.851 (0.03) | 0.884 (0.04) | ||
0.95 | 0.811 (0.05) | 0.676 (0.11) | 0.804 (0.04) | 0.836 (0.07) | ||
1000 | 0.80 | 0.898 (0.03) | 0.886 (0.09) | 0.887 (0.03) | 0.897 (0.02) | |
0.90 | 0.859 (0.02) | 0.777 (0.12) | 0.852 (0.03) | 0.880 (0.04) | ||
0.95 | 0.809 (0.04) | 0.677 (0.11) | 0.803 (0.04) | 0.850 (0.07) |
Nonlinear classification - averaged prediction accuracy over 100 independent repetitions are reported under model B1 and B2. The numbers in parentheses are the corresponding standard deviations
Model | Self-training | Generative model | SVM | Proposed method | ||
---|---|---|---|---|---|---|
100 | 0.80 | 0.516 (0.06) | 0.506 (0.05) | 0.515 (0.05) | 0.584 (0.05) | |
0.90 | 0.508 (0.06) | 0.499 (0.07) | 0.508 (0.06) | 0.565 (0.06) | ||
0.95 | 0.513 (0.06) | 0.508 (0.06) | 0.514 (0.06) | 0.557 (0.06) | ||
500 | 0.80 | 0.517 (0.04) | 0.502 (0.04) | 0.516 (0.04) | 0.577 (0.04) | |
0.90 | 0.506 (0.04) | 0.499 (0.04) | 0.507 (0.04) | 0.557 (0.04) | ||
0.95 | 0.517 (0.04) | 0.504 (0.04) | 0.518 (0.04) | 0.550 (0.05) | ||
1000 | 0.80 | 0.514 (0.03) | 0.502 (0.03) | 0.514 (0.03) | 0.577 (0.03) | |
0.90 | 0.512 (0.03) | 0.502 (0.04) | 0.512 (0.03) | 0.564 (0.04) | ||
0.95 | 0.514 (0.04) | 0.501 (0.04) | 0.514 (0.04) | 0.545 (0.04) | ||
100 | 0.80 | 0.517 (0.05) | 0.350 (0.08) | 0.523 (0.05) | 0.550 (0.13) | |
0.90 | 0.487 (0.06) | 0.351 (0.08) | 0.487 (0.06) | 0.492 (0.13) | ||
0.95 | 0.443 (0.05) | 0.328 (0.06) | 0.446 (0.06) | 0.477 (0.07) | ||
500 | 0.80 | 0.521 (0.03) | 0.352 (0.07) | 0.523 (0.03) | 0.553 (0.10) | |
0.90 | 0.488 (0.04) | 0.354 (0.07) | 0.489 (0.04) | 0.504 (0.10) | ||
0.95 | 0.455 (0.05) | 0.334 (0.04) | 0.456 (0.05) | 0.496 (0.08) | ||
1000 | 0.80 | 0.523 (0.03) | 0.353 (0.07) | 0.526 (0.02) | 0.564 (0.08) | |
0.90 | 0.492 (0.03) | 0.350 (0.07) | 0.493 (0.03) | 0.526 (0.09) | ||
0.95 | 0.453 (0.05) | 0.334 (0.03) | 0.456 (0.04) | 0.479 (0.08) |
Results for the iris data – averaged prediction accuracy over 100 independent random partitioning for the iris data. The numbers in parentheses are the corresponding standard deviations
prop | Self-training | Generative model | SVM | Proposed method | |
---|---|---|---|---|---|
100:50 | 0.7 | 0.944 (0.04) | 0.668 (0.31) | 0.945 (0.04) | 0.937 (0.05) |
0.8 | 0.927 (0.05) | 0.567 (0.36) | 0.930 (0.05) | 0.939 (0.05) | |
0.9 | 0.868 (0.17) | 0.523 (0.32) | 0.868 (0.11) | 0.872 (0.12) | |
75:75 | 0.7 | 0.942 (0.04) | 0.594 (0.25) | 0.940 (0.03) | 0.925 (0.05) |
0.8 | 0.919 (0.05) | 0.530 (0.35) | 0.920 (0.04) | 0.926 (0.05) | |
0.9 | 0.854 (0.15) | 0.515 (0.33) | 0.849 (0.10) | 0.875 (0.10) | |
50:100 | 0.7 | 0.919 (0.06) | 0.564 (0.21) | 0.924 (0.05) | 0.922 (0.05) |
0.8 | 0.886 (0.12) | 0.490 (0.20) | 0.871 (0.10) | 0.887 (0.11) | |
0.9 | 0.797 (0.18) | 0.500 (0.24) | 0.809 (0.10) | 0.840 (0.07) |