TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Object Tracking	CATER	TFCNet	Top 1 Accuracy	79.7	# 2
Video Object Tracking	CATER	TFCNet	Top 5 Accuracy	95.5	# 2
Video Object Tracking	CATER	TFCNet	L1	0.47	# 3
Action Recognition	Diving-48	TFCNet	Accuracy	88.3	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tfcnet-temporal-fully-connected-networks-for/video-object-tracking-on-cater)](https://paperswithcode.com/sota/video-object-tracking-on-cater?p=tfcnet-temporal-fully-connected-networks-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tfcnet-temporal-fully-connected-networks-for/action-recognition-on-diving-48)](https://paperswithcode.com/sota/action-recognition-on-diving-48?p=tfcnet-temporal-fully-connected-networks-for)`

TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning

11 Mar 2022 · Shiwen Zhang ·

Temporal Reasoning is one important functionality for vision intelligence. In computer vision research community, temporal reasoning is usually studied in the form of video classification, for which many state-of-the-art Neural Network structures and dataset benchmarks are proposed in recent years, especially 3D CNNs and Kinetics. However, some recent works found that current video classification benchmarks contain strong biases towards static features, thus cannot accurately reflect the temporal modeling ability. New video classification benchmarks aiming to eliminate static biases are proposed, with experiments on these new benchmarks showing that the current clip-based 3D CNNs are outperformed by RNN structures and recent video transformers. In this paper, we find that 3D CNNs and their efficient depthwise variants, when video-level sampling strategy is used, are actually able to beat RNNs and recent vision transformers by significant margins on static-unbiased temporal reasoning benchmarks. Further, we propose Temporal Fully Connected Block (TFC Block), an efficient and effective component, which approximates fully connected layers along temporal dimension to obtain video-level receptive field, enhancing the spatiotemporal reasoning ability. With TFC blocks inserted into Video-level 3D CNNs (V3D), our proposed TFCNets establish new state-of-the-art results on synthetic temporal reasoning benchmark, CATER, and real world static-unbiased dataset, Diving48, surpassing all previous methods.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Recognition

Classification

Video Classification

Video Object Tracking

Datasets

CATER

Results from the Paper

Edit

Ranked #2 on Video Object Tracking on CATER

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Object Tracking	CATER	TFCNet	Top 1 Accuracy	79.7	# 2	Compare
			Top 5 Accuracy	95.5	# 2	Compare
			L1	0.47	# 3	Compare
Action Recognition	Diving-48	TFCNet	Accuracy	88.3	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove