TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Expression Segmentation	A2D Sentences	VT-Capsule	Precision@0.5	0.526	# 21
Referring Expression Segmentation	A2D Sentences	VT-Capsule	Precision@0.9	0.036	# 21
Referring Expression Segmentation	A2D Sentences	VT-Capsule	IoU overall	0.568	# 23
Referring Expression Segmentation	A2D Sentences	VT-Capsule	IoU mean	0.460	# 22
Referring Expression Segmentation	A2D Sentences	VT-Capsule	Precision@0.6	0.450	# 21
Referring Expression Segmentation	A2D Sentences	VT-Capsule	Precision@0.7	0.345	# 21
Referring Expression Segmentation	A2D Sentences	VT-Capsule	Precision@0.8	0.207	# 21
Referring Expression Segmentation	A2D Sentences	VT-Capsule	AP	0.303	# 17
Referring Expression Segmentation	J-HMDB	VT-Capsule	Precision@0.5	0.677	# 18
Referring Expression Segmentation	J-HMDB	VT-Capsule	Precision@0.6	0.513	# 18
Referring Expression Segmentation	J-HMDB	VT-Capsule	Precision@0.7	0.283	# 17
Referring Expression Segmentation	J-HMDB	VT-Capsule	Precision@0.8	0.051	# 15
Referring Expression Segmentation	J-HMDB	VT-Capsule	Precision@0.9	0.000	# 11
Referring Expression Segmentation	J-HMDB	VT-Capsule	AP	0.261	# 14
Referring Expression Segmentation	J-HMDB	VT-Capsule	IoU overall	0.535	# 19
Referring Expression Segmentation	J-HMDB	VT-Capsule	IoU mean	0.550	# 17

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-textual-capsule-routing-for-text-based/referring-expression-segmentation-on-j-hmdb)](https://paperswithcode.com/sota/referring-expression-segmentation-on-j-hmdb?p=visual-textual-capsule-routing-for-text-based)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-textual-capsule-routing-for-text-based/referring-expression-segmentation-on-a2d)](https://paperswithcode.com/sota/referring-expression-segmentation-on-a2d?p=visual-textual-capsule-routing-for-text-based)`

Visual-Textual Capsule Routing for Text-Based Video Segmentation

CVPR 2020 · Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, Mubarak Shah ·

Joint understanding of vision and natural language is a challenging problem with a wide range of applications in artificial intelligence. In this work, we focus on integration of video and text for the task of actor and action video segmentation from a sentence. We propose a capsule-based approach which performs pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, which provide a more effective representation in comparison with standard convolution based features. Our novel visual-textual routing mechanism allows for the fusion of video and text capsules to successfully localize the actor and action. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action video localization, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of our capsule network for text selective actor and action localization in videos. The proposed method also improves upon the performance of the existing state-of-the art works on single frame-based localization.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Action Localization

Referring Expression Segmentation

Sentence

Video Segmentation

Video Semantic Segmentation

Datasets

Visual Question Answering

JHMDB

A2D

A2D Sentences

Results from the Paper

Add Remove

Ranked #14 on Referring Expression Segmentation on J-HMDB

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Expression Segmentation	A2D Sentences	VT-Capsule	Precision@0.5	0.526	# 21	Compare
			Precision@0.9	0.036	# 21	Compare
			IoU overall	0.568	# 23	Compare
			IoU mean	0.460	# 22	Compare
			Precision@0.6	0.450	# 21	Compare
			Precision@0.7	0.345	# 21	Compare
			Precision@0.8	0.207	# 21	Compare
			AP	0.303	# 17	Compare
Referring Expression Segmentation	J-HMDB	VT-Capsule	Precision@0.5	0.677	# 18	Compare
			Precision@0.6	0.513	# 18	Compare
			Precision@0.7	0.283	# 17	Compare
			Precision@0.8	0.051	# 15	Compare
			Precision@0.9	0.000	# 11	Compare
			AP	0.261	# 14	Compare
			IoU overall	0.535	# 19	Compare
			IoU mean	0.550	# 17	Compare

Methods

Add Remove

Capsule Network • Convolution

Edit Social Preview

Visual-Textual Capsule Routing for Text-Based Video Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove