TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Action Recognition	HMDB-51	HF-ECOLite (ImageNet+Kinetics pretrain)	Average accuracy of 3 splits	71.13	# 51
Action Recognition	Something-Something V1	HF-TSN (ImageNet pretraining)	Top 1 Accuracy	41.97	# 71

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hierarchical-feature-aggregation-networks-for/action-recognition-in-videos-on-hmdb-51)](https://paperswithcode.com/sota/action-recognition-in-videos-on-hmdb-51?p=hierarchical-feature-aggregation-networks-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hierarchical-feature-aggregation-networks-for/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=hierarchical-feature-aggregation-networks-for)`

Hierarchical Feature Aggregation Networks for Video Action Recognition

29 May 2019 · Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz ·

Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches interact as they develop into the higher level representation. The interaction happens between feature differencing and averaging at each level of the hierarchy, and it has convolutional structure that learns to select the appropriate mode locally in contrast to previous works that impose one of the modes globally (e.g. feature differencing) as a design choice. We further constrain this interaction to be conservative, e.g. a local feature subtraction in one branch is compensated by the addition on another, such that the total feature flow is preserved. We evaluate the performance of our proposal on a number of existing models, i.e. TSN, TRN and ECO, to show its flexibility and effectiveness in improving action recognition performance.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Temporal Action Localization

Datasets

HMDB51

Something-Something V1

Results from the Paper

Edit

Ranked #51 on Action Recognition on HMDB-51 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Uses Extra Training Data	Result	Benchmark
Action Recognition	HMDB-51	HF-ECOLite (ImageNet+Kinetics pretrain)	Average accuracy of 3 splits	71.13	# 51			Compare
Action Recognition	Something-Something V1	HF-TSN (ImageNet pretraining)	Top 1 Accuracy	41.97	# 71			Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Hierarchical Feature Aggregation Networks for Video Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove