TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	LSMDC	JSFusion	text-to-video R@1	9.1	# 34
Video Retrieval	LSMDC	JSFusion	text-to-video R@5	21.2	# 30
Video Retrieval	LSMDC	JSFusion	text-to-video R@10	34.1	# 28
Video Retrieval	LSMDC	JSFusion	text-to-video Median Rank	36	# 18
Video Retrieval	MSR-VTT	JSFusion	text-to-video R@1	10.2	# 35
Video Retrieval	MSR-VTT	JSFusion	text-to-video R@10	43.2	# 30
Video Retrieval	MSR-VTT	JSFusion	text-to-video Median Rank	13	# 14
Video Retrieval	MSR-VTT	JSFusion	video-to-text R@5	31.2	# 12
Video Retrieval	MSR-VTT-1kA	JSFusion	text-to-video R@1	10.2	# 56
Video Retrieval	MSR-VTT-1kA	JSFusion	text-to-video R@5	31.2	# 55
Video Retrieval	MSR-VTT-1kA	JSFusion	text-to-video R@10	43.2	# 58
Video Retrieval	MSR-VTT-1kA	JSFusion	text-to-video Median Rank	13	# 38

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-joint-sequence-fusion-model-for-video/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=a-joint-sequence-fusion-model-for-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-joint-sequence-fusion-model-for-video/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=a-joint-sequence-fusion-model-for-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-joint-sequence-fusion-model-for-video/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=a-joint-sequence-fusion-model-for-video)`

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

ECCV 2018 · Youngjae Yu, Jongseok Kim, Gunhee Kim ·

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

PDF Abstract ECCV 2018 PDF ECCV 2018 Abstract

Code

Add Remove Mark official

antoine77340/howto100m

237

ruc-aimc-lab/nt2vr

Tasks

Add Remove

Multiple-choice

Question Answering

Retrieval

Semantic Similarity

Semantic Textual Similarity

Sentence

Video Question Answering

Video Retrieval

Visual Question Answering (VQA)

Datasets

Introduced in the Paper:

MSRVTT-MC

Used in the Paper:

MSR-VTT

LSMDC

Results from the Paper

Edit

Ranked #34 on Video Retrieval on LSMDC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	LSMDC	JSFusion	text-to-video R@1	9.1	# 34	Compare
			text-to-video R@5	21.2	# 30	Compare
			text-to-video R@10	34.1	# 28	Compare
			text-to-video Median Rank	36	# 18	Compare
Video Retrieval	MSR-VTT	JSFusion	text-to-video R@1	10.2	# 35	Compare
			text-to-video R@10	43.2	# 30	Compare
			text-to-video Median Rank	13	# 14	Compare
			video-to-text R@5	31.2	# 12	Compare
Video Retrieval	MSR-VTT-1kA	JSFusion	text-to-video R@1	10.2	# 56	Compare
			text-to-video R@5	31.2	# 55	Compare
			text-to-video R@10	43.2	# 58	Compare
			text-to-video Median Rank	13	# 38	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove