TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Knowledge Distillation	ImageNet	LSHFM (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	71.72	# 29
Knowledge Distillation	MS COCO	LSHFM (T: ResNet101 S: MobileNetV2)	mAP	73.73	# 2
Knowledge Distillation	MS COCO	LSHFM (T: ResNet101 S: ResNet50)	mAP	77.16	# 1
Knowledge Distillation	PASCAL VOC	LSHFM (T: ResNet101 S: ResNet50)	mAP	93.17	# 1
Knowledge Distillation	PASCAL VOC	LSHFM (T: ResNet101 S: MobileNetV2)	mAP	90.14	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/in-defense-of-feature-mimicking-for-knowledge/knowledge-distillation-on-coco)](https://paperswithcode.com/sota/knowledge-distillation-on-coco?p=in-defense-of-feature-mimicking-for-knowledge)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/in-defense-of-feature-mimicking-for-knowledge/knowledge-distillation-on-pascal-voc)](https://paperswithcode.com/sota/knowledge-distillation-on-pascal-voc?p=in-defense-of-feature-mimicking-for-knowledge)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/in-defense-of-feature-mimicking-for-knowledge/knowledge-distillation-on-imagenet)](https://paperswithcode.com/sota/knowledge-distillation-on-imagenet?p=in-defense-of-feature-mimicking-for-knowledge)`

Distilling Knowledge by Mimicking Features

3 Nov 2020 · Guo-Hua Wang, Yifan Ge, Jianxin Wu ·

Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logits as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction. To meet this requirement, we propose a loss term based on locality-sensitive hashing (LSH). With the help of this new loss, our method indeed mimics feature directions more accurately, relaxes constraints on feature magnitudes, and achieves state-of-the-art distillation accuracy. We provide theoretical analyses of how LSH facilitates feature direction mimicking, and further extend feature mimicking to multi-label recognition and object detection.

PDF Abstract

Code

Add Remove Mark official

DoctorKey/LSHFM.singleclassification official

DoctorKey/LSHFM.detection official

DoctorKey/LSHFM.multiclassification official

Tasks

Add Remove

Knowledge Distillation

object-detection

Object Detection

Datasets

ImageNet

MS COCO

CIFAR-100

PASCAL VOC

Results from the Paper

Edit

Ranked #1 on Knowledge Distillation on MS COCO (mAP metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Knowledge Distillation	ImageNet	LSHFM (T: ResNet-34 S:ResNet-18)	Top-1 accuracy %	71.72	# 29	Compare
Knowledge Distillation	MS COCO	LSHFM (T: ResNet101 S: MobileNetV2)	mAP	73.73	# 2	Compare
Knowledge Distillation	MS COCO	LSHFM (T: ResNet101 S: ResNet50)	mAP	77.16	# 1	Compare
Knowledge Distillation	PASCAL VOC	LSHFM (T: ResNet101 S: ResNet50)	mAP	93.17	# 1	Compare
Knowledge Distillation	PASCAL VOC	LSHFM (T: ResNet101 S: MobileNetV2)	mAP	90.14	# 2	Compare

Methods

Add Remove

Softmax

Edit Social Preview

Distilling Knowledge by Mimicking Features

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove