TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Expression Segmentation	A2D Sentences	RefVOS	Precision@0.5	0.578	# 19
Referring Expression Segmentation	A2D Sentences	RefVOS	Precision@0.9	0.093	# 14
Referring Expression Segmentation	A2D Sentences	RefVOS	IoU overall	0.672	# 12
Referring Expression Segmentation	A2D Sentences	RefVOS	IoU mean	0.497	# 20
Referring Expression Segmentation	A2D Sentences	RefVOS	Precision@0.6	0.534	# 17
Referring Expression Segmentation	A2D Sentences	RefVOS	Precision@0.7	0.456	# 17
Referring Expression Segmentation	A2D Sentences	RefVOS	Precision@0.8	0.311	# 16
Referring Expression Segmentation	A2D Sentences	HINet	Precision@0.5	0.611	# 16
Referring Expression Segmentation	A2D Sentences	HINet	Precision@0.9	0.12	# 11
Referring Expression Segmentation	A2D Sentences	HINet	IoU overall	0.679	# 10
Referring Expression Segmentation	A2D Sentences	HINet	IoU mean	0.529	# 17
Referring Expression Segmentation	A2D Sentences	HINet	Precision@0.6	0.559	# 16
Referring Expression Segmentation	A2D Sentences	HINet	Precision@0.7	0.486	# 15
Referring Expression Segmentation	A2D Sentences	HINet	Precision@0.8	0.342	# 12
Referring Expression Segmentation	DAVIS 2017 (val)	HINet	J&F 1st frame	50.2	# 7
Referring Expression Segmentation	DAVIS 2017 (val)	HINet	J&F Full video	47.9	# 2
Referring Expression Segmentation	J-HMDB	RefVOS	Precision@0.5	0.731	# 15
Referring Expression Segmentation	J-HMDB	RefVOS	Precision@0.6	0.62	# 14
Referring Expression Segmentation	J-HMDB	RefVOS	Precision@0.7	0.392	# 9
Referring Expression Segmentation	J-HMDB	RefVOS	Precision@0.8	0.088	# 10
Referring Expression Segmentation	J-HMDB	RefVOS	Precision@0.9	0.0	# 11
Referring Expression Segmentation	J-HMDB	RefVOS	IoU overall	0.606	# 11
Referring Expression Segmentation	J-HMDB	RefVOS	IoU mean	0.568	# 16
Referring Expression Segmentation	J-HMDB	HINet	Precision@0.5	0.819	# 8
Referring Expression Segmentation	J-HMDB	HINet	Precision@0.6	0.736	# 8
Referring Expression Segmentation	J-HMDB	HINet	Precision@0.7	0.542	# 8
Referring Expression Segmentation	J-HMDB	HINet	Precision@0.8	0.168	# 5
Referring Expression Segmentation	J-HMDB	HINet	Precision@0.9	0.4	# 1
Referring Expression Segmentation	J-HMDB	HINet	IoU overall	0.652	# 7
Referring Expression Segmentation	J-HMDB	HINet	IoU mean	0.627	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hierarchical-interaction-network-for-video/referring-expression-segmentation-on-j-hmdb)](https://paperswithcode.com/sota/referring-expression-segmentation-on-j-hmdb?p=hierarchical-interaction-network-for-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hierarchical-interaction-network-for-video/referring-expression-segmentation-on-davis)](https://paperswithcode.com/sota/referring-expression-segmentation-on-davis?p=hierarchical-interaction-network-for-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hierarchical-interaction-network-for-video/referring-expression-segmentation-on-a2d)](https://paperswithcode.com/sota/referring-expression-segmentation-on-a2d?p=hierarchical-interaction-network-for-video)`

Hierarchical interaction network for video object segmentation from referring expressions

British Machine Vision Conference 2021 · Zhao Yang, Yansong Tang, Luca Bertinetto, Hengshuang Zhao, Philip Torr ·

In this paper, we investigate the problem of video object segmentation from referring expressions (VOSRE). Conventional methods typically perform multi-modal fusion based on linguistic features and the visual features extracted from the top layer of the visual encoder, which limits these models' ability to represent multi-modal inputs at different semantic and spatial granularity levels. To address this issue, we present an end-to-end hierarchical interaction network (HINet) for the VOSRE problem. Our model leverages the feature pyramid produced by the visual encoder to generate multiple levels of multi-modal features. This allows more flexible representation of various linguistic concepts (e.g., object attributes and categories) in different levels of the multi-modal features. Moreover, we further extract signals of moving objects from optical flow input, and utilize them as complementary cues for highlighting the referent and suppressing the background with a motion gating mechanism. In contrast to previous methods, this strategy allows our model to make online predictions without requiring the whole video as input. Despite its simplicity, our proposed HINet improves over the previous state of the art on the DAVIS-16, DAVIS-17, and J-HMDB datasets for the VOSRE task, demonstrating its effectiveness and generality.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Optical Flow Estimation

Referring Expression Segmentation

Semantic Segmentation

Video Object Segmentation

Video Semantic Segmentation

Datasets

DAVIS

DAVIS 2017

JHMDB

Referring Expressions for DAVIS 2016 & 2017

A2D

A2D Sentences

Results from the Paper

Add Remove

Ranked #1 on Referring Expression Segmentation on J-HMDB (Precision@0.9 metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Expression Segmentation	A2D Sentences	RefVOS	Precision@0.5	0.578	# 19	Compare
			Precision@0.9	0.093	# 14	Compare
			IoU overall	0.672	# 12	Compare
			IoU mean	0.497	# 20	Compare
			Precision@0.6	0.534	# 17	Compare
			Precision@0.7	0.456	# 17	Compare
			Precision@0.8	0.311	# 16	Compare
Referring Expression Segmentation	A2D Sentences	HINet	Precision@0.5	0.611	# 16	Compare
			Precision@0.9	0.12	# 11	Compare
			IoU overall	0.679	# 10	Compare
			IoU mean	0.529	# 17	Compare
			Precision@0.6	0.559	# 16	Compare
			Precision@0.7	0.486	# 15	Compare
			Precision@0.8	0.342	# 12	Compare
Referring Expression Segmentation	DAVIS 2017 (val)	HINet	J&F 1st frame	50.2	# 7	Compare
Referring Expression Segmentation	DAVIS 2017 (val)	HINet	J&F Full video	47.9	# 2	Compare
Referring Expression Segmentation	J-HMDB	RefVOS	Precision@0.5	0.731	# 15	Compare
			Precision@0.6	0.62	# 14	Compare
			Precision@0.7	0.392	# 9	Compare
			Precision@0.8	0.088	# 10	Compare
			Precision@0.9	0.0	# 11	Compare
			IoU overall	0.606	# 11	Compare
			IoU mean	0.568	# 16	Compare
Referring Expression Segmentation	J-HMDB	HINet	Precision@0.5	0.819	# 8	Compare
			Precision@0.6	0.736	# 8	Compare
			Precision@0.7	0.542	# 8	Compare
			Precision@0.8	0.168	# 5	Compare
			Precision@0.9	0.4	# 1	Compare
			IoU overall	0.652	# 7	Compare
			IoU mean	0.627	# 8	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Hierarchical interaction network for video object segmentation from referring expressions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove