TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	Diving-48	TimeSformer-HR	Accuracy	78	# 15
Action Recognition	Diving-48	TimeSformer-L	Accuracy	81	# 14
Action Recognition	Diving-48	TimeSformer	Accuracy	75	# 17
Video Question Answering	Howto100M-QA	TimeSformer	Accuracy	62.1	# 1
Action Classification	Kinetics-400	TimeSformer	Acc@1	78	# 123
Action Classification	Kinetics-400	TimeSformer	Acc@5	93.7	# 86
Action Classification	Kinetics-400	TimeSformer-L (ImageNet-21k pretrain)	Acc@1	80.7	# 86
Action Classification	Kinetics-400	TimeSformer-L (ImageNet-21k pretrain)	Acc@5	94.7	# 59
Action Classification	Kinetics-400	TimeSformer-L (ImageNet-21k pretrain)	FLOPs (G) x views	7140x3	# 1
Action Classification	Kinetics-400	TimeSformer-L (ImageNet-21k pretrain)	Parameters (M)	121.4	# 25
Action Classification	Kinetics-400	TimeSformer-HR	Acc@1	79.7	# 101
Action Classification	Kinetics-400	TimeSformer-HR	Acc@5	94.4	# 70
Action Recognition	Something-Something V2	TimeSformer-L	Top-1 Accuracy	62.3	# 103
Action Recognition	Something-Something V2	TimeSformer	Top-1 Accuracy	59.5	# 112
Action Recognition	Something-Something V2	TimeSformer-HR	Top-1 Accuracy	62.5	# 101
Anomaly Detection	UBnormal	TimeSformer	AUC	68.5%	# 3
Anomaly Detection	UBnormal	TimeSformer	RBDC	0.04	# 3
Anomaly Detection	UBnormal	TimeSformer	TBDC	0.05	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/is-space-time-attention-all-you-need-for/video-question-answering-on-howto100m-qa)](https://paperswithcode.com/sota/video-question-answering-on-howto100m-qa?p=is-space-time-attention-all-you-need-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/is-space-time-attention-all-you-need-for/anomaly-detection-on-ubnormal)](https://paperswithcode.com/sota/anomaly-detection-on-ubnormal?p=is-space-time-attention-all-you-need-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/is-space-time-attention-all-you-need-for/action-recognition-on-diving-48)](https://paperswithcode.com/sota/action-recognition-on-diving-48?p=is-space-time-attention-all-you-need-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/is-space-time-attention-all-you-need-for/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=is-space-time-attention-all-you-need-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/is-space-time-attention-all-you-need-for/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=is-space-time-attention-all-you-need-for)`

Is Space-Time Attention All You Need for Video Understanding?

9 Feb 2021 · Gedas Bertasius, Heng Wang, Lorenzo Torresani ·

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/TimeSformer official

1,421

open-mmlab/mmaction2

3,887

towhee-io/towhee

2,986

PaddlePaddle/PaddleVideo

1,413

The-AI-Summer/self-attention-cv

1,140

See all 13 implementations

Tasks

Add Remove

Action Classification

Action Recognition

Anomaly Detection

General Classification

Video Classification

Video Question Answering

Video Understanding

Datasets

ImageNet

Kinetics

Kinetics 400

Something-Something V2

HowTo100M

Kinetics-600

UBnormal

Results from the Paper

Edit

Ranked #1 on Video Question Answering on Howto100M-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	Diving-48	TimeSformer-HR	Accuracy	78	# 15	Compare
Action Recognition	Diving-48	TimeSformer-L	Accuracy	81	# 14	Compare
Action Recognition	Diving-48	TimeSformer	Accuracy	75	# 17	Compare
Video Question Answering	Howto100M-QA	TimeSformer	Accuracy	62.1	# 1	Compare
Action Classification	Kinetics-400	TimeSformer	Acc@1	78	# 123	Compare
Action Classification	Kinetics-400	TimeSformer	Acc@5	93.7	# 86	Compare
Action Classification	Kinetics-400	TimeSformer-L (ImageNet-21k pretrain)	Acc@1	80.7	# 86	Compare
			Acc@5	94.7	# 59	Compare
			FLOPs (G) x views	7140x3	# 1	Compare
			Parameters (M)	121.4	# 25	Compare
Action Classification	Kinetics-400	TimeSformer-HR	Acc@1	79.7	# 101	Compare
Action Classification	Kinetics-400	TimeSformer-HR	Acc@5	94.4	# 70	Compare
Action Recognition	Something-Something V2	TimeSformer-L	Top-1 Accuracy	62.3	# 103	Compare
Action Recognition	Something-Something V2	TimeSformer	Top-1 Accuracy	59.5	# 112	Compare
Action Recognition	Something-Something V2	TimeSformer-HR	Top-1 Accuracy	62.5	# 101	Compare
Anomaly Detection	UBnormal	TimeSformer	AUC	68.5%	# 3	Compare
			RBDC	0.04	# 3	Compare
			TBDC	0.05	# 3	Compare

Methods

Add Remove

Temporal attention • TimeSformer

Edit Social Preview

Is Space-Time Attention All You Need for Video Understanding?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove