Hostname: page-component-89b8bd64d-shngb Total loading time: 0 Render date: 2026-05-08T19:35:19.455Z Has data issue: false hasContentIssue false

Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Published online by Cambridge University Press:  17 November 2021

Hsing-Hung Chou*
Affiliation:
Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan
Ching-Te Chiu
Affiliation:
Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan Institute of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Yi-Ping Liao
Affiliation:
Institute of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
*
Corresponding author: Hsing-Hung Chou Email: paul8301526@gmail.com

Abstract

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (https://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright
Copyright © The Author(s), 2021. Published by Cambridge University Press
Figure 0

Fig. 1. Overall architecture of our proposed methods. There are three parts of our architecture. First, we propose cross-layer matrix to exact more features by FSP [6] adopting the proposed Gramian matrix in the orange part. Second, we adopt the KL Divergence in the offline environment to make S-DNN find a wider robust minimum in the brown part. Finally, we propose the use of offline ensemble pre-trained T-DNN to teach a S-DNN by using stochastic mean in the red part.

Figure 1

Fig. 2. (a) Cross one layer. (b) Cross two layers. (c) Cross three layers. (d) Our proposed.

Figure 2

Fig. 3. (a) CIFAR-100. (b)ImageNet64*64.

Figure 3

Fig. 4. T-DNN and S-DNN of the VGG and ResNet models. T-DNN: VGG-11 and ResNet-32. S-DNN: VGG-6 and ResNet-8.

Figure 4

Fig. 5. T-DNN and S-DNN of the MobileNet models. T-DNN: MobileNet-16. S-DNN: MobileNet-9.

Figure 5

Table 1. Classification results after knowledge distillation (VGG-11->6) on CIFAR-100 dataset.

Figure 6

Table 2. Classification results after knowledge distillation (ResNet-32->8) on CIFAR-100 dataset.

Figure 7

Table 3. Classification results after knowledge distillation (MobileNet-16->9) on ImageNet64*64.

Figure 8

Fig. 6. (a) Cross one layer. (b) Cross two layers. (c) Cross three layers.

Figure 9

Table 4. Different proposed method of cross matrix (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Figure 10

Table 5. Different proposed method of cross matrix (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Figure 11

Fig. 7. (a) Cross-one layer. (b) Cross-two layers.(c) Cross-three layers. (d)Cross-four layers.

Figure 12

Table 6. Different proposed method of cross matrix (MobileNet-16>9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

Figure 13

Fig. 8. Illustration of using KL Divergence.

Figure 14

Table 7. Differential of adding KL Divergence (VGG-11>9) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Figure 15

Table 8. Differential of adding KL Divergence (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Figure 16

Table 9. Differential of adding KL Divergence (MobileNet-16>9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

Figure 17

Fig. 9. (a) One pre-trained teacher. (b) Two pre-trained teachers. (c) Three pre-trained teachers.

Figure 18

Table 10. Different numbers of teachers (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Figure 19

Table 11. Different numbers of teachers (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Figure 20

Table 12. Different numbers of teachers (MobileNet-16->9) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9.

Figure 21

Table 13. Combination of proposed methods (VGG-11->6) with CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Figure 22

Table 14. Different proposed method of cross matrix (ResNet-32->8) with CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Figure 23

Table 15. Combination of proposed methods (MobileNet) with ImageNet64*64. T-DNN: MobileNet-16, S-DNN: MobileNet-9. P1: cross-three layers. P2: KL Divergence. P3: three pre-trained teachers.

Figure 24

Table 16. Computation, parameters, and average Top-1 accuracy comparison with VGG-11 and VGG-6 on CIFAR-100. T-DNN: VGG-11, S-DNN: VGG-6.

Figure 25

Table 17. Computation, parameters, and average Top-1 accuracy comparison with ResNet-32 and ResNet-8 on CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Figure 26

Table 18. Computation, parameters, and average Top-1 accuracy comparison with ResNet-32 and ResNet-8 on CIFAR-100. T-DNN: ResNet-32, S-DNN: ResNet-8.

Figure 27

Fig. 10. Limitation of FSP [6]. $m_{T},\,n_{T}$ represent the dimension of T-DNN Gramian matrix and $m_{S},\,n_{S}$ represent the dimension of S-DNN Gramian matrix.

Figure 28

Fig. 11. Illustration of huge layer number difference. C, convolutional layer; FC, fully-connected layer.

Figure 29

Fig. 12. Using $1\times 1$ convolutional layers to decrease channels.

Figure 30

Fig. 13. The illustration of two-stage knowledge distillation. C, convolutional layer; FC, fully connected layer.

Figure 31

Fig. 14. ResNet-50/ResNet-18 /ResNet-10.

Figure 32

Table 19. Classification results after knowledge distillation (ResNet-50->10) with CIFAR-100 dataset.

Figure 33

Table 20. Methods of adding $1\times 1$ convolution to solve the limitation of proposed method and multi-steps compression with ResNet models and CIFAR-100. T-DNN1: ResNet-50. T-DNN2: ResNet-18. S-DNN: ResNet-10.