TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Action Recognition	HMDB-51	S:VGG-16, T:VGG-16 (ImageNet pretrained)	Average accuracy of 3 splits	65.4	# 61
Action Recognition	UCF101	S:VGG-16, T:VGG-16 (ImageNet pretrain)	3-fold Accuracy	92.5	# 60

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutional-two-stream-network-fusion-for/action-recognition-in-videos-on-ucf101)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ucf101?p=convolutional-two-stream-network-fusion-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/convolutional-two-stream-network-fusion-for/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=convolutional-two-stream-network-fusion-for)`

Convolutional Two-Stream Network Fusion for Video Action Recognition

CVPR 2016 · Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman ·

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

PDF Abstract CVPR 2016 PDF CVPR 2016 Abstract

Code

Add Remove Mark official

feichtenhofer/twostreamfusion official

706

Tasks

Add Remove

Action Recognition

Action Recognition In Videos

Temporal Action Localization

Vocal Bursts Valence Prediction

Datasets

UCF101

HMDB51

Results from the Paper

Edit

Ranked #60 on Action Recognition on UCF101 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Result	Benchmark
Action Recognition	HMDB-51	S:VGG-16, T:VGG-16 (ImageNet pretrained)	Average accuracy of 3 splits	65.4	# 61			Compare
Action Recognition	UCF101	S:VGG-16, T:VGG-16 (ImageNet pretrain)	3-fold Accuracy	92.5	# 60			Compare

Methods

Add Remove

Convolution • Softmax

Edit Social Preview

Convolutional Two-Stream Network Fusion for Video Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove