TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition	HMDB-51	R-STAN-152	Average accuracy of 3 splits	55.16	# 70
Action Recognition	HMDB-51	R-STAN-50	Average accuracy of 3 splits	62.8	# 64
Action Recognition	UCF101	R-STAN-50	3-fold Accuracy	91.5	# 64
Action Recognition	UCF101	R-STAN-101	3-fold Accuracy	94.5	# 49

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-stan-residual-spatial-temporal-attention/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=r-stan-residual-spatial-temporal-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/r-stan-residual-spatial-temporal-attention/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=r-stan-residual-spatial-temporal-attention)`

R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition

IEEE Access ( Volume: 7 ) 2019 · Quanle Liu, Xiangjiu Che, Mei Bie ·

Two-stream network architecture has the ability to capture temporal and spatial features from videos simultaneously and has achieved excellent performance on video action recognition tasks. However, there is a fair amount of redundant information in both temporal and spatial dimensions in videos, which increases the complexity of network learning. To solve this problem, we propose residual spatial-temporal attention network (R-STAN), a feed-forward convolutional neural network using residual learning and spatial-temporal attention mechanism for video action recognition, which makes the network focus more on discriminative temporal and spatial features. In our R-STAN, each stream is constructed by stacking residual spatial-temporal attention blocks (R-STAB), the spatial-temporal attention modules integrated in the residual blocks have the ability to generate attention-aware features along temporal and spatial dimensions, which largely reduce the redundant information. Together with the specific characteristic of residual learning, we are able to construct a very deep network for learning spatial-temporal information in videos. With the layers going deeper, the attention-aware features from the different R-STABs can change adaptively. We validate our R-STAN through a large number of experiments on UCF101 and HMDB51 datasets. Our experiments show that our proposed network combined with residual learning and spatial-temporal attention mechanism contributes substantially to the performance of video action recognition.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Temporal Action Localization

Datasets

UCF101

HMDB51

Results from the Paper

Add Remove

Ranked #49 on Action Recognition on UCF101

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition	HMDB-51	R-STAN-152	Average accuracy of 3 splits	55.16	# 70	Compare
Action Recognition	HMDB-51	R-STAN-50	Average accuracy of 3 splits	62.8	# 64	Compare
Action Recognition	UCF101	R-STAN-50	3-fold Accuracy	91.5	# 64	Compare
Action Recognition	UCF101	R-STAN-101	3-fold Accuracy	94.5	# 49	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove