
In recent years, deep learning has shown epoch-making performance in various fields such as image recognition (Tan and Le, 2019), speech recognition (Saon
Despite these outstanding performances, the computational cost of the deep learning models due to their complexity and large number of parameters made them difficult to deploy on devices such as smartphones (Han
Many techniques have been studied and developed to solve the inverse problem of accuracy and computation time in deep learning. Recently, one of the most popular techniques for model compression is knowledge distillation (KD). The basic idea of KD is to compress the model by transferring the knowledge from the pre-trained, large, and complex model, the Teacher model, to the relatively small model, the student model (Hinton
Hinton
When a well-performing teacher model is successfully compressed, computers or small devices such as smartphones can learn the features of the data with less computational cost and make fine decisions just as well as before compression. Since the main idea of RKD lies in reducing the structural differences in the vectors between the teacher model and the student model, there will be further improvements if we convey more structural information about the vectors. Perhaps there is a limit to extracting more structural information when comparing a pair of vectors. However, when converting the insight of a pair of vectors to an area-wise approach, we can consider the structural information in the form of a triangle by additionally considering zero vector.
In this paper, we propose area-wise RKD as an extension of RKD. After generating the embedding vectors through the model, we can extract relational knowledge through a function that measures the structural differences between the vectors. The proposed structural relationship information of embedding vectors is the area of a triangle composed of two embedding vectors and a zero vector. A classification model is constructed to reduce the difference in relational knowledge obtained from the teacher model and the student model. By adding the area-wise relational information to the distillation, we achieve improvement in model compression and generalization, two main goals of KD.
We apply the proposed method to build classification models using image and audio data to compare the performance of the models with that of the existing methods. In the experiment using audio data, we compare the proposed method with existing ones in terms of model compression. In the experiment using image data, we set teacher and student models to have an identical structure to check that the proposed area-wise RKD can increase the generalization performance more than the existing methods.
Overall, the main contributions of this paper are:
We propose area-wise RKD to convey additional structural relational information between vectors when training the model. We achieve improvements in KD for the student model by using the proposed method combined with the existing RKD.
We propose a loss function that can adjust the ratio of the relational information between the teacher model and the student model.
We provide extensive experiments comparing the performance of different methods for model compression and generalization on both image and audio data.
The rest of this paper is organized as follows. Section 2 reviews some related works in detail. Section 3 presents the proposed method with a new loss function. Section 4 provides the details of the datasets, the preprocessing process, the models used as teacher and student models, and the experimental results. Finally, in Section 5, we draw conclusions while discussing future research directions.
In KD, we first train a large and complex deep neural network model, and then when training a small model, we pass the hidden or output layer values of the teacher model to a relatively small model and set those values as the target values. For example, a large and complex model is a single large model with strong regulation, such as an ensemble of multiple trained models or a model with dropout methods. By training large and complex models, we can build models more efficiently for deployment by filtering out redundant information from training data and transferring information to smaller models (Hinton
In general, the loss function used to train a model reflects as closely as possible whether the data belong to true classes. But during the training process, we also need to consider how good the predictive performance of the model is, i.e., how well it generalizes to make predictions to new data. This requires information about whether the generalization process is going in the right way, but usually, this kind of information is difficult to obtain. However, in the KD process, since the classification probability can be estimated using the value obtained during the training of a large and complex model using data, the generalization method can be transferred to a smaller model using the value (Hinton
Hinton
where
Figure 2 shows an example of training results on images labeled 5 on the MNIST data set (LeCun
Furlanello
In BAN, the student model is trained using the values of the output layer of the teacher model, and the output layer of the student model seeks to obtain optimized parameters. Experiments have shown that student models obtained over multiple generations can improve the performance of model generalization and outperform teacher models. Moreover, a better model was obtained when the average ensemble was applied to the student model.
Figure 3 shows the structure of a BAN, where
The final ensemble model is based on the average of the probability values for each student model to classify the data:
where
Park
Figure 4 shows how the conventional KD and RKD techniques are applied, where
where
After obtaining relational information from the teacher and student models, the student model is trained to reduce the information gap using Huber loss:
Huber loss is commonly used for robust regression, and the smaller the
Distance-wise distillation loss
where
RKD has been proposed as a new method to extract structural relationship information for output. However, since the distance and angle information are converted through a loss function, we believe that additional information is needed to preserve structural relationships.
In this paper, we propose area-wise RKD as an extension of RKD. The loss function of the area-wise RKD is defined as:
where
where
Then, training is conducted by reducing the size difference of each triangle obtained from the teacher model and the student model. Figure 5 shows an illustration in which the embedding vectors obtained after the training data has passed through the teacher model and the student model form a triangle in the embedding space. In conventional RKD, structural differences in embedding values are learned by reducing distance differences or angular differences, respectively. By adding the area differences to them, it is expected to provide richer information on the structural differences between the teacher model and the student model and build a better classification model.
To construct the student model, the output layer values are used to convey knowledge about the probabilities of the classes using cross-entropy as the loss function. Relational information is additionally conveyed using the embedding value of the flattened layer before the output layer. Relational information can be scaled by multiplying
where
UrbanSound8K is a dataset of 8,732 sounds collected from cities (Salamon
CIFAR-100 is essentially the same as CIFAR-10 with 60,000 color images. However, CIFAR-100 is again divided into 100 classes. Each class has 600 images, and the data for each class is divided into 500 training data and 100 test data. Figure 8 shows some images of different classes.
In this study, image data are used for experiments to improve the generalization performance of the model and audio data are used for model compression experiments.
Audio recognition tasks require extracting valid features from an input signal. Acoustic characteristics of audio can be obtained using the Python package LibROSA (McFee
A spectrogram is a method to visualize an acoustic signal. The sound signal is divided into several sections and the values obtained through FFT are converted into frequency and amplitude values over time. Chromagram is a method for visualizing pitch and is widely used to analyze music. It has the advantage of being able to capture melody and harmony characteristics from sound data. After applying the local Fourier transform of the sound signal, a Constant-Q chromagram or CENS can be obtained by the further conversion process. Figure 9 shows an example of audio features extracted using the above method.
The feature values extracted from the audio are stacked in multiple layers and used as input to a convolutional neural network (CNN) (Piczak, 2015). Figure 10 shows the structure of the CNN model. For audio data less than 4 seconds in length, zero padding is applied to match the input length of the data. Data augmentation is done by adding Gaussian noise before extracting acoustic features to double the amount of data. Figure 11 shows an example of data augmentation seen in a time-amplitude graph before and after adding Gaussian noise.
In this paper, we use a vanilla CNN and residual neural network (ResNet) model (He
ResNet introduces skip connections between hidden layers. If we assume that
In the experiment using CIFAR-100, both the teacher model and the student model are set to ResNet-56 to classify images. After training the teacher model, the softmax function value of the output layer is set as the target value of the student model to perform RKD.
In the experiment with UrbanSound8K, the teacher model is set to ResNet-18 and the student model uses a vanilla CNN with 2 convolutional layers. RKD is further applied using the flattened layer before the output layer. We also introduce Hinton KD to the output layer together with RKD to see if the model compression performance is improved.
For CIFAR-100, we randomly split the 50,000 training data into 45,000 training data and 5,000 validation data, and use 10,000 test data to measure test accuracy. And for UrbanSound8K, we average the validation accuracies after performing 10-fold cross-validation.
In the experiment using CIFAR-100, Nesterov momentum is used as an optimizer (Nesterov, 1983). The batch size is 256 and the weight decay is 0.0001. The initial learning rate is set to 0.001, and for the 60
In the experiment using UrbanSound8K, Nesterov momentum is used as the optimizer, the batch size is set to 8, and the weight decay is 0.0001. The initial learning rate is set to 0.1, and the learning rate is multiplied by 0.1 every 10 epochs. The delta of the Huber loss function is set to 1.
All models are constructed using Tensorflow v.1.14 (Abadi
In the experiment using audio data, KD was applied to compress the model. The results are summarized in Table 3. The teacher model is ResNet-18, the student model is a CNN with 2 convolutional layers, and it shows that the teacher model with 18 hidden layers has been successfully compressed into a student model with 2 convolutional layers. The number of weight parameters used in the teacher model is 178,466 and the number of parameters used in the student model is 116,538. That is, model compression through KD reduced the number of weight parameters by about 35%. We built a classification model using ResNet-18 and achieved a 10-fold cross-validation accuracy of 68.16%. In addition, when ResNet-18 was used as the teacher model and a CNN with two convolutional layers was used as the student model, the student model was able to achieve 63.70% accuracy when compressing the model using the conventional KD. We introduced several combination methods of RKD. First, when only angle-wise RKD was used, the accuracy was 64.46%, which was about 0.7% better than the baseline model. When the two methods of RKD were used together, the accuracy of the classification model was higher than that when only one method was used. The accuracies of the model using distance and angle, distance and area, and angle and area were 65.12%, 65.13%, and 65.28%, respectively. When all three methods (distance, angle, and area) were applied together, the accuracy improved to 64.27%, but it was no better than applying only the relational knowledge of angle and area together. Additional experiments were also conducted to compare the model compression performance when RKD was applied with Hinton KD. When the temperature was set to 1.5, the accuracy of the student model was 64.37%. In addition, when angle-wise and area-wise RKD were applied together with Hinton KD, the accuracy was the highest at 65.53%.
The experimental results using CIFAR-100 are shown in Table 4. ResNet-56 showed an accuracy of 59.72%. After performing BAN with 3 generations, the ensemble of 3 student models has an accuracy of 64.30%, which is much higher than that of ResNet-56. Several RKDs were introduced in the BAN to improve the performance of the model. At this time, all weights of the loss function related to RKD were selected as 20.
We first considered one RKD for model training and then examined the performance of the averaged ensemble model. When each of RKDs based on distance, angle, and area was applied, the accuracies of the ensemble models were 64.44%, 65.36%, and 64.28%, respectively. Therefore, in this case, the model using angle-wise RKD improved the accuracy by more than 1% and was the best among the three methods.
When two types of RKD were applied together, the accuracies of the model learned using RKD based on distance and angle, distance and area, and angle and area were 64.32, 63.86, and 64.82%, respectively. Finally, when all three RKDs were applied simultaneously, the accuracy was 64.50%.
The results show that the best performance improvement was obtained using angle-wise RKD and sub-optimal performance was obtained using both angle-wise and area-wise RKD together.
In this study, we proposed area-wise RKD with a new loss function. This method can reduce the difference in relational information between the teacher model and the student model by obtaining the area information generated by embedding vectors of the data. In addition to the existing distance- and angle-wise RKDs, it is expected to provide richer information on the structural differences between teacher and student models and further improve the performance of the student model.
Existing methods have tried to improve the performance of the student model by transferring information about the distance between embedding vectors or the angle formed by the embedding vectors from the teacher model to the student model. On the other hand, the proposed method can be viewed as considering the information that includes the distance and angle considered in the existing methods at once by trying to convey the information about the area of the triangle created by the embedding vectors. For example, even if the distance between the embedding vectors is the same, it can be seen that a small angle and a large angle between the vectors represent different information.
Conversely, even if the angle between embedding vectors is the same, a short distance and a long distance between vectors represent different information. However, considering the area of a triangle containing both distance information and angle information, the effect of integrating each piece of information to be delivered through the two existing methods can be expected.
To verify the performance of area-wise RKD, we conducted two experiments using audio and image data. In the experiment using audio data, model compression was performed using the proposed method along with the existing methods. The results show that the performance of model compression is best when both area- and angle-wise RKDs are applied together.
In the experiment using image data, we considered BAN to improve the generalization performance of the model by introducing various RKDs. As a result, the performance improvement was the best when angle-wise RKD was applied, and sub-optimal accuracy was shown in the experiment when area-wise and angle-wise RKD were used together.
For future research, we plan to apply the proposed method to data in other domains, such as natural language, video, and time series data. In addition, we could consider applying the proposed method to RNN-based models, Seq2Seq models, and attention-based models.
Class information of UrbanSound8
Class ID | Sound Class |
---|---|
0 | air conditioner |
1 | car horn |
2 | children playing |
3 | dog bark |
4 | drilling |
5 | engine idling |
6 | gun shot |
7 | jackhammer |
8 | siren |
9 | street music |
The number of samples per fold
Fold | # of Samples |
---|---|
1 | 873 |
2 | 888 |
3 | 925 |
4 | 990 |
5 | 936 |
6 | 823 |
7 | 838 |
8 | 806 |
9 | 816 |
10 | 837 |
Results of model compression experiments using UrbanSound8K
Model | Accuracy(%) |
---|---|
ResNet-18 (teacher) | 68.16 |
2 Conv CNN (baseline student) | 63.70 |
CNN+angle (50) | 64.46 |
CNN+distance+angle (40, 10) | 65.12 |
CNN+distance+area (20, 20) | 65.13 |
CNN+angle+area (40, 20) | 65.28 |
CNN+distance+angle+area (40, 20, 40) | 64.27 |
CNN+Hinton (T=1.5) | 64.37 |
CNN+Hinton+angle+area (10, 50) | 65.53 |
The values in parentheses refer to weights of loss functions for RKD.
Results of model compression experiments using CIFAR-100
Model | Accuracy (%) |
---|---|
ResNet-56 | 59.72 |
BAN ensemble (baseline) | 64.30 |
BAN+distance ensemble | 64.44 |
BAN+angle ensemble | 65.36 |
BAN+area ensemble | 64.28 |
BAN+distance+angle ensemble | 64.32 |
BAN+distance+area ensemble | 63.86 |
BAN+angle+area ensemble | 64.82 |
BAN+distance+angle+area ensemble | 64.50 |