TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Knowledge Distillation	ImageNet	WTTM (T:resnet50, S:mobilenet-v1)	Top-1 accuracy %	73.09	# 15
Knowledge Distillation	ImageNet	WTTM (T: DeiT III-Small S:DeiT-Tiny)	Top-1 accuracy %	77.03	# 12
Knowledge Distillation	ImageNet	WTTM (T: DeiT III-Small S:DeiT-Tiny)	CRD training setting	✘	# 1
Knowledge Distillation	ImageNet	WTTM (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	72.19	# 21
Knowledge Distillation	ImageNet	WTTM (T: ResNet-34 S:ResNet-18)	CRD training setting	✓	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/knowledge-distillation-based-on-transformed/knowledge-distillation-on-imagenet)](https://paperswithcode.com/sota/knowledge-distillation-on-imagenet?p=knowledge-distillation-based-on-transformed)`

Knowledge Distillation Based on Transformed Teacher Matching

17 Feb 2024 · Kaixiang Zheng, En-hui Yang ·

As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.

PDF Abstract

Code

Add Remove Mark official

zkxufo/TTM official

Tasks

Add Remove

Knowledge Distillation

Datasets

ImageNet

CIFAR-100

Results from the Paper

Add Remove

Ranked #12 on Knowledge Distillation on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Knowledge Distillation	ImageNet	WTTM (T:resnet50, S:mobilenet-v1)	Top-1 accuracy %	73.09	# 15	Compare
Knowledge Distillation	ImageNet	WTTM (T: DeiT III-Small S:DeiT-Tiny)	Top-1 accuracy %	77.03	# 12	Compare
Knowledge Distillation	ImageNet	WTTM (T: DeiT III-Small S:DeiT-Tiny)	CRD training setting	✘	# 1	Compare
Knowledge Distillation	ImageNet	WTTM (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	72.19	# 21	Compare
Knowledge Distillation	ImageNet	WTTM (T: ResNet-34 S:ResNet-18)	CRD training setting	✓	# 1	Compare

Methods

Add Remove

Knowledge Distillation

Edit Social Preview

Knowledge Distillation Based on Transformed Teacher Matching

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove