TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Temporal Action Localization	ActivityNet-1.3	ActionFormer (TSP feautures)	mAP IOU@0.5	54.7	# 8
Temporal Action Localization	ActivityNet-1.3	ActionFormer (TSP feautures)	mAP	36.6	# 14
Temporal Action Localization	ActivityNet-1.3	ActionFormer (TSP feautures)	mAP IOU@0.75	37.8	# 5
Temporal Action Localization	ActivityNet-1.3	ActionFormer (TSP feautures)	mAP IOU@0.95	8.4	# 12
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	Avg mAP (0.1-0.5)	23.5	# 4
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	mAP IOU@0.1	26.6	# 4
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	mAP IOU@0.2	25.4	# 4
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	mAP IOU@0.3	24.2	# 4
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	mAP IOU@0.4	22.3	# 4
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	mAP IOU@0.5	19.1	# 4
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	mAP IOU@0.5	71.0	# 8
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	mAP IOU@0.3	82.1	# 8
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	mAP IOU@0.4	77.8	# 8
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	mAP IOU@0.6	59.4	# 8
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	mAP IOU@0.7	43.9	# 8
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	Avg mAP (0.3:0.7)	66.8	# 11
audio-visual event localization	UnAV-100	ActionFormer	mAP	42.2	# 2
audio-visual event localization	UnAV-100	ActionFormer	AP@IOU0.5	43.5	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionformer-localizing-moments-of-actions/audio-visual-event-localization-on-unav-100)](https://paperswithcode.com/sota/audio-visual-event-localization-on-unav-100?p=actionformer-localizing-moments-of-actions)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionformer-localizing-moments-of-actions/temporal-action-localization-on-epic-kitchens)](https://paperswithcode.com/sota/temporal-action-localization-on-epic-kitchens?p=actionformer-localizing-moments-of-actions)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionformer-localizing-moments-of-actions/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=actionformer-localizing-moments-of-actions)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/actionformer-localizing-moments-of-actions/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=actionformer-localizing-moments-of-actions)`

ActionFormer: Localizing Moments of Actions with Transformers

16 Feb 2022 · Chenlin Zhang, Jianxin Wu, Yin Li ·

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.

PDF Abstract

Code

Add Remove Mark official

happyharrycn/actionformer_release official

384

Tasks

Add Remove

Action Localization

Action Recognition

audio-visual event localization

object-detection

Temporal Action Localization

Video Understanding

Datasets

ActivityNet

THUMOS14

EPIC-KITCHENS-100

UnAV-100

Results from the Paper

Edit

Ranked #2 on audio-visual event localization on UnAV-100

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Temporal Action Localization	ActivityNet-1.3	ActionFormer (TSP feautures)	mAP IOU@0.5	54.7	# 8	Compare
			mAP	36.6	# 14	Compare
			mAP IOU@0.75	37.8	# 5	Compare
			mAP IOU@0.95	8.4	# 12	Compare
Temporal Action Localization	EPIC-KITCHENS-100	ActionFormer (verb)	Avg mAP (0.1-0.5)	23.5	# 4	Compare
			mAP IOU@0.1	26.6	# 4	Compare
			mAP IOU@0.2	25.4	# 4	Compare
			mAP IOU@0.3	24.2	# 4	Compare
			mAP IOU@0.4	22.3	# 4	Compare
			mAP IOU@0.5	19.1	# 4	Compare
Temporal Action Localization	THUMOS’14	ActionFormer (I3D features)	mAP IOU@0.5	71.0	# 8	Compare
			mAP IOU@0.3	82.1	# 8	Compare
			mAP IOU@0.4	77.8	# 8	Compare
			mAP IOU@0.6	59.4	# 8	Compare
			mAP IOU@0.7	43.9	# 8	Compare
			Avg mAP (0.3:0.7)	66.8	# 11	Compare
audio-visual event localization	UnAV-100	ActionFormer	mAP	42.2	# 2	Compare
audio-visual event localization	UnAV-100	ActionFormer	AP@IOU0.5	43.5	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

ActionFormer: Localizing Moments of Actions with Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove