TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Recognition In Videos	AVA v2.1	YOWO+LFB*	mAP (Val)	19.2	# 1
Action Recognition In Videos	AVA v2.2	YOWO+LFB*	mAP (Val)	20.2	# 1
Action Detection	J-HMDB	YOWO + LFB	Video-mAP 0.2	88.3	# 2
Action Detection	J-HMDB	YOWO + LFB	Video-mAP 0.5	85.9	# 3
Action Detection	J-HMDB	YOWO + LFB	Frame-mAP 0.5	75.7	# 3
Action Detection	J-HMDB	YOWO	Video-mAP 0.2	87.8	# 3
Action Detection	J-HMDB	YOWO	Video-mAP 0.5	85.7	# 4
Action Detection	J-HMDB	YOWO	Frame-mAP 0.5	74.4	# 4
Action Detection	UCF101-24	YOWO + LFB	Video-mAP 0.2	78.6	# 6
Action Detection	UCF101-24	YOWO + LFB	Video-mAP 0.1	86.1	# 1
Action Detection	UCF101-24	YOWO + LFB	Video-mAP 0.5	53.1	# 6
Action Detection	UCF101-24	YOWO + LFB	Frame-mAP 0.5	87.3	# 2
Action Detection	UCF101-24	YOWO	Video-mAP 0.2	75.8	# 10
Action Detection	UCF101-24	YOWO	Video-mAP 0.1	82.5	# 3
Action Detection	UCF101-24	YOWO	Video-mAP 0.5	48.8	# 11
Action Detection	UCF101-24	YOWO	Frame-mAP 0.5	80.4	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/you-only-watch-once-a-unified-cnn/action-recognition-in-videos-on-ava-v2-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ava-v2-1?p=you-only-watch-once-a-unified-cnn)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/you-only-watch-once-a-unified-cnn/action-recognition-in-videos-on-ava-v2-2)](https://paperswithcode.com/sota/action-recognition-in-videos-on-ava-v2-2?p=you-only-watch-once-a-unified-cnn)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/you-only-watch-once-a-unified-cnn/action-detection-on-ucf101-24)](https://paperswithcode.com/sota/action-detection-on-ucf101-24?p=you-only-watch-once-a-unified-cnn)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/you-only-watch-once-a-unified-cnn/action-detection-on-j-hmdb)](https://paperswithcode.com/sota/action-detection-on-j-hmdb?p=you-only-watch-once-a-unified-cnn)`

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

15 Nov 2019 · Okan Köpüklü, Xiangyu Wei, Gerhard Rigoll ·

Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.

PDF Abstract

Code

Add Remove Mark official

wei-tim/YOWO official

816

zwtu/YOWO-Paddle

BoChenUIUC/YOWO

Stepphonwol/my_yowo

Tasks

Add Remove

Actin Detection

Action Detection

Action Localization

Action Recognition In Videos

Datasets

JHMDB

AVA UCF101-24

Results from the Paper

Edit

Ranked #1 on Action Recognition In Videos on AVA v2.2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Recognition In Videos	AVA v2.1	YOWO+LFB*	mAP (Val)	19.2	# 1	Compare
Action Recognition In Videos	AVA v2.2	YOWO+LFB*	mAP (Val)	20.2	# 1	Compare
Action Detection	J-HMDB	YOWO + LFB	Video-mAP 0.2	88.3	# 2	Compare
			Video-mAP 0.5	85.9	# 3	Compare
			Frame-mAP 0.5	75.7	# 3	Compare
Action Detection	J-HMDB	YOWO	Video-mAP 0.2	87.8	# 3	Compare
			Video-mAP 0.5	85.7	# 4	Compare
			Frame-mAP 0.5	74.4	# 4	Compare
Action Detection	UCF101-24	YOWO + LFB	Video-mAP 0.2	78.6	# 6	Compare
			Video-mAP 0.1	86.1	# 1	Compare
			Video-mAP 0.5	53.1	# 6	Compare
			Frame-mAP 0.5	87.3	# 2	Compare
Action Detection	UCF101-24	YOWO	Video-mAP 0.2	75.8	# 10	Compare
			Video-mAP 0.1	82.5	# 3	Compare
			Video-mAP 0.5	48.8	# 11	Compare
			Frame-mAP 0.5	80.4	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove