Understanding Knowledge Distillation

1 Jan 2021 · Taehyeon Kim, Jaehoon Oh, Nakyil Kim, Sangwook Cho, Se-Young Yun ·

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures with high accuracy with a few parameters. However, there is a very limited understanding of why and when KD works well. This paper reveals KD's intriguing behaviors, which we believe useful in a better understanding of KD. We first investigate the role of the temperature scaling hyperparameter in KD. It is theoretically shown that the KD loss focuses on the logit vector matching rather than the label matching between the teacher and the student as the temperature grows up. We also find that KD with a sufficiently large temperature outperforms any other recently modified KD methods from extensive experiments. Based on this observation, we conjecture that the logit vector matching is more important than the label matching. To verify this conjecture, we test an extreme logit learning model, where the KD is implemented with Mean Squared Error (MSE) between the student's logit and the teacher's logit. The KD with MSE consistently shows the best accuracy for various environments. We analyze the different learning behavior of KD with respect to the temperature using a new data uncertainty estimator, coined as Top Logit Difference (TLD). We then study the KD performances for various data sizes. When there are a few data or a few labels, very interestingly, the incapacious teacher with a shallow depth structure facilitates better generalization than teachers having wider and deeper structures.

PDF Abstract