TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Object Detection	ImageNet VID	VSTAM	MAP	91.1	# 3
Object Detection	UA-DETRAC	VSTAM	mAP	90.39	# 1
Video Instance Segmentation	YouTube-VIS validation	VSTAM	mask AP	39.0	# 34

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-sparse-transformer-with-attention/object-detection-on-ua-detrac)](https://paperswithcode.com/sota/object-detection-on-ua-detrac?p=video-sparse-transformer-with-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-sparse-transformer-with-attention/video-object-detection-on-imagenet-vid)](https://paperswithcode.com/sota/video-object-detection-on-imagenet-vid?p=video-sparse-transformer-with-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/video-sparse-transformer-with-attention/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=video-sparse-transformer-with-attention)`

Video Sparse Transformer With Attention-Guided Memory for Video Object Detection

IEEE Access 2022 · Masato Fujitake, Akihiro Sugimoto ·

Detecting objects in a video, known as Video Object Detection (VOD), is challenging since appearance changes of objects over time may bring detection errors. Recent research has focused on aggregating features from adjacent frames to compensate for the deteriorated appearances of a frame. Moreover, using distant frames is also proposed to deal with deteriorated appearances over several frames. Since an object’s position may change significantly at a distant frame, they only use features of object candidate regions, which do not depend on their position. However, such methods rely on object candidate regions’ detection performance and are not practical for deteriorated appearances. In this paper, we enhance features element-wisely before the object candidate region detection, proposing Video Sparse Transformer with Attention-guided Memory (VSTAM). Furthermore, we propose aggregating element-wise features sparsely to reduce processing time and memory cost. In addition, we introduce an external memory update strategy based on the utilization of the aggregation to hold long-term information effectively. Our method achieved 8.3% and 11.1% accuracy gain from the baseline on ImageNet VID and UA-DETRAC datasets. Our method demonstrates superior performance against state-of-the-art results on widely used VOD datasets.

PDF

Code

Add Remove Mark official

Malik1998/VSTAM

Tasks

Add Remove

Object

object-detection

Object Detection

Position

Video Instance Segmentation

Video Object Detection

Datasets

YouTube-VIS 2019

UA-DETRAC ImageNet VID

Results from the Paper

Add Remove

Ranked #1 on Object Detection on UA-DETRAC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Object Detection	ImageNet VID	VSTAM	MAP	91.1	# 3	Compare
Object Detection	UA-DETRAC	VSTAM	mAP	90.39	# 1	Compare
Video Instance Segmentation	YouTube-VIS validation	VSTAM	mask AP	39.0	# 34	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • GELU • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Sparse Transformer • Transformer • Weight Decay

Edit Social Preview

Video Sparse Transformer With Attention-Guided Memory for Video Object Detection

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove