TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Referring Expression Segmentation	Referring Expressions for DAVIS 2016 & 2017	MUTR	J&F 1st frame	68.0	# 1
Referring Expression Segmentation	Referring Expressions for DAVIS 2016 & 2017	MUTR	J	64.8	# 1
Referring Expression Segmentation	Referring Expressions for DAVIS 2016 & 2017	MUTR	F	71.3	# 1
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MUTR	J&F	68.4	# 3
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MUTR	J	66.4	# 3
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MUTR	F	70.4	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/referred-by-multi-modality-a-unified-temporal/referring-expression-segmentation-on-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-1?p=referred-by-multi-modality-a-unified-temporal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/referred-by-multi-modality-a-unified-temporal/referring-expression-segmentation-on-refer-1)](https://paperswithcode.com/sota/referring-expression-segmentation-on-refer-1?p=referred-by-multi-modality-a-unified-temporal)`

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

25 May 2023 · Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei zhang, Hongyang Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao ·

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR.

PDF Abstract

Code

Add Remove Mark official

opengvlab/mutr official

Tasks

Add Remove

Object

Referring Expression Segmentation

Referring Video Object Segmentation

Semantic Segmentation

Video Object Segmentation

Video Semantic Segmentation

Datasets

RefCOCO

YouTube-VOS 2018

Referring Expressions for DAVIS 2016 & 2017

Refer-YouTube-VOS

Results from the Paper

Edit

Ranked #1 on Referring Expression Segmentation on Referring Expressions for DAVIS 2016 & 2017

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Referring Expression Segmentation	Referring Expressions for DAVIS 2016 & 2017	MUTR	J&F 1st frame	68.0	# 1	Compare
			J	64.8	# 1	Compare
			F	71.3	# 1	Compare
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	MUTR	J&F	68.4	# 3	Compare
			J	66.4	# 3	Compare
			F	70.4	# 3	Compare

Methods

Add Remove

VOS

Edit Social Preview

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove