TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Diving-48	GC-TDN	Accuracy	87.6	# 7
Egocentric Activity Recognition	EGTEA	GC-TSM	Average Accuracy	65.1	# 3
Action Recognition	Something-Something V2	GC-TDN Ensemble (R50,8+16)	Top-1 Accuracy	67.8	# 56
Action Recognition	Something-Something V2	GC-TDN Ensemble (R50,8+16)	Top-5 Accuracy	91.2	# 41
Action Recognition	Something-Something V2	GC-TDN Ensemble (R50,8+16)	Parameters	27.4	# 33
Action Recognition	Something-Something V2	GC-TDN Ensemble (R50,8+16)	GFLOPs	110.1	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/group-contextualization-for-video-recognition/egocentric-activity-recognition-on-egtea-1)](https://paperswithcode.com/sota/egocentric-activity-recognition-on-egtea-1?p=group-contextualization-for-video-recognition)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/group-contextualization-for-video-recognition/action-recognition-on-diving-48)](https://paperswithcode.com/sota/action-recognition-on-diving-48?p=group-contextualization-for-video-recognition)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/group-contextualization-for-video-recognition/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=group-contextualization-for-video-recognition)`

Group Contextualization for Video Recognition

CVPR 2022 · Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Xiangnan He ·

Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/Group-Contextualization.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

haoyanbin918/group-contextualization official

Tasks

Add Remove

Action Recognition

Egocentric Activity Recognition

Video Recognition

Datasets

Kinetics

Kinetics 400

Something-Something V2

Something-Something V1

EGTEA

Results from the Paper

Edit

Ranked #3 on Egocentric Activity Recognition on EGTEA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Diving-48	GC-TDN	Accuracy	87.6	# 7	Compare
Egocentric Activity Recognition	EGTEA	GC-TSM	Average Accuracy	65.1	# 3	Compare
Action Recognition	Something-Something V2	GC-TDN Ensemble (R50,8+16)	Top-1 Accuracy	67.8	# 56	Compare
			Top-5 Accuracy	91.2	# 41	Compare
			Parameters	27.4	# 33	Compare
			GFLOPs	110.1	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Group Contextualization for Video Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove