TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	WIT	WIT-ALL	R@1	0.346	# 1
Image Retrieval	WIT	WIT-ALL	R@5	0.642	# 1
Image Retrieval	WIT	CC (Conceptual Captions)	R@1	0.048	# 2
Image Retrieval	WIT	CC (Conceptual Captions)	R@5	0.122	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/wit-wikipedia-based-image-text-dataset-for/image-retrieval-on-wit)](https://paperswithcode.com/sota/image-retrieval-on-wit?p=wit-wikipedia-based-image-text-dataset-for)`

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

2 Mar 2021 · Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, Marc Najork ·

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (https://github.com/google-research-datasets/wit) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 12K examples) and provides cross-lingual texts for many images. Third, WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, WIT provides a very challenging real-world test set, as we empirically illustrate using an image-text retrieval task as an example.

PDF Abstract

Code

Add Remove Mark official

google-research-datasets/wit official

957

clip-italian/clip-italian

↳ Quickstart in

Colab

Spaces

171

paullerner/viquae

Tasks

Add Remove

BIG-bench Machine Learning

Image Retrieval

Representation Learning

Retrieval

Text Retrieval

Datasets

Introduced in the Paper:

WIT

Used in the Paper:

ImageNet

MS COCO

Flickr30k

Conceptual Captions

VCR Multi30K mC4

Results from the Paper

Edit

Ranked #1 on Image Retrieval on WIT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	WIT	WIT-ALL	R@1	0.346	# 1	Compare
Image Retrieval	WIT	WIT-ALL	R@5	0.642	# 1	Compare
Image Retrieval	WIT	CC (Conceptual Captions)	R@1	0.048	# 2	Compare
Image Retrieval	WIT	CC (Conceptual Captions)	R@5	0.122	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove