TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Retrieval	LSMDC	SSML	text-to-video R@1	4.2	# 14
Zero-Shot Video Retrieval	LSMDC	SSML	text-to-video R@5	11.6	# 14
Zero-Shot Video Retrieval	LSMDC	SSML	text-to-video R@10	17.1	# 14
Zero-Shot Video Retrieval	MSR-VTT	SSML	text-to-video R@1	8.0	# 34
Zero-Shot Video Retrieval	MSR-VTT	SSML	text-to-video R@5	21.3	# 33
Zero-Shot Video Retrieval	MSR-VTT	SSML	text-to-video R@10	29.3	# 33
Visual Question Answering	MSRVTT-QA	SSML	Accuracy	0.35	# 3
Zero-Shot Video Retrieval	MSVD	SSML	text-to-video R@1	13.66	# 12
Zero-Shot Video Retrieval	MSVD	SSML	text-to-video R@5	35.7	# 11
Zero-Shot Video Retrieval	MSVD	SSML	text-to-video R@10	47.74	# 11
Video Retrieval	MSVD	SSML	text-to-video R@1	20.3	# 23
Video Retrieval	MSVD	SSML	text-to-video R@5	49.0	# 21
Video Retrieval	MSVD	SSML	text-to-video R@10	63.3	# 21
Video Retrieval	MSVD	SSML	text-to-video Median Rank	6.0	# 16
Video Retrieval	MSVD	SSML	text-to-video R@50	--	# 2
Video Retrieval	MSVD	SSML	text-to-video Mean Rank	--	# 16
Visual Question Answering (VQA)	MSVD-QA	SSML	Accuracy	0.351	# 31

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/noise-estimation-using-density-estimation-for/visual-question-answering-on-msrvtt-qa-2)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-2?p=noise-estimation-using-density-estimation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/noise-estimation-using-density-estimation-for/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=noise-estimation-using-density-estimation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/noise-estimation-using-density-estimation-for/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=noise-estimation-using-density-estimation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/noise-estimation-using-density-estimation-for/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=noise-estimation-using-density-estimation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/noise-estimation-using-density-estimation-for/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=noise-estimation-using-density-estimation-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/noise-estimation-using-density-estimation-for/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=noise-estimation-using-density-estimation-for)`

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

6 Mar 2020 · Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein ·

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is challenging and expensive. Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. However, these methods often choose to ignore the presence of high levels of noise and thus yield sub-optimal results. In this work, we show that the problem of noise estimation for multimodal data can be reduced to a multimodal density estimation task. Using multimodal density estimation, we propose a noise estimation building block for multimodal representation learning that is based strictly on the inherent correlation between different modalities. We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance on five different benchmark datasets for two challenging multimodal tasks: Video Question Answering and Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic error bound substantiating our empirical results and analyze failure cases. Code: https://github.com/elad-amrani/ssml.

PDF Abstract

Code

Add Remove Mark official

elad-amrani/ssml official

Tasks

Add Remove

Density Estimation

Noise Estimation

Question Answering

Representation Learning

Retrieval

Text to Video Retrieval

Video Question Answering

Video Retrieval

Visual Question Answering

Visual Question Answering (VQA)

Zero-Shot Video Retrieval

Datasets

Visual Question Answering

MSR-VTT

MSVD

HowTo100M

LSMDC MSRVTT-QA MSVD-QA

Results from the Paper

Edit

Ranked #3 on Visual Question Answering on MSRVTT-QA (Accuracy metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Retrieval	LSMDC	SSML	text-to-video R@1	4.2	# 14	Compare
			text-to-video R@5	11.6	# 14	Compare
			text-to-video R@10	17.1	# 14	Compare
Zero-Shot Video Retrieval	MSR-VTT	SSML	text-to-video R@1	8.0	# 34	Compare
			text-to-video R@5	21.3	# 33	Compare
			text-to-video R@10	29.3	# 33	Compare
Visual Question Answering	MSRVTT-QA	SSML	Accuracy	0.35	# 3	Compare
Zero-Shot Video Retrieval	MSVD	SSML	text-to-video R@1	13.66	# 12	Compare
			text-to-video R@5	35.7	# 11	Compare
			text-to-video R@10	47.74	# 11	Compare
Video Retrieval	MSVD	SSML	text-to-video R@1	20.3	# 23	Compare
			text-to-video R@5	49.0	# 21	Compare
			text-to-video R@10	63.3	# 21	Compare
			text-to-video Median Rank	6.0	# 16	Compare
			text-to-video R@50	--	# 2	Compare
			text-to-video Mean Rank	--	# 16	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Source Paper	Compare
Visual Question Answering (VQA)	MSVD-QA	SSML	Accuracy	0.351	# 31		See all

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit