To Smooth or not to Smooth? On Compatibility between Label Smoothing and Knowledge Distillation

29 Sep 2021 · Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Yunqing Zhao, Ngai-Man Cheung ·

This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints. Specifically, Muller et al. [1] claim that LS erases relative information in the logits; therefore a LS-trained teacher can hurt KD. On the contrary, Shen et al. [2] claim that LS enlarges the distance between semantically similar classes; therefore a LS-trained teacher is compatible with KD. Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question $-$ to smooth or not to smooth a teacher network? $-$ unanswered. In this work, we establish a foundational understanding on the compatibility between LS and KD. We begin by meticulously scrutinizing these contradictory findings under a unified empirical consistency. Through our profound investigation, we discover that in the presence of a LS-trained teacher, KD at higher temperatures systematically diffuses penultimate layer representations learnt by the student towards semantically similar classes. This systematic diffusion essentially curtails the benefits of distilling from a LS-trained teacher, thereby rendering KD at increased temperatures ineffective. We show this systematic diffusion qualitatively by visualizing penultimate layer representations, and quantitatively using our proposed relative distance metric called diffusion index ($\eta$). Importantly, our discovered systematic diffusion was the missing concept which is instrumental in understanding and resolving these contradictory findings. Our discovery is comprehensively supported by large-scale experiments and analyses including image classification (standard, fine-grained), neural machine translation and compact student network distillation tasks spanning across multiple datasets and teacher-student architectures. Finally, we shed light on the question $-$ to smooth or not to smooth a teacher network? $-$ in order to help practitioners make informed decisions.

PDF Abstract