TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval with Multi-Modal Query	Fashion200k	Show and Tell	Recall@1	12.3	# 6
Image Retrieval with Multi-Modal Query	Fashion200k	Show and Tell	Recall@10	40.2	# 6
Image Retrieval with Multi-Modal Query	Fashion200k	Show and Tell	Recall@50	61.8	# 6
Image Retrieval with Multi-Modal Query	MIT-States	Show and Tell	Recall@1	11.9	# 3
Image Retrieval with Multi-Modal Query	MIT-States	Show and Tell	Recall@5	31.0	# 3
Image Retrieval with Multi-Modal Query	MIT-States	Show and Tell	Recall@10	42.0	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/show-and-tell-a-neural-image-caption/image-retrieval-with-multi-modal-query-on-mit)](https://paperswithcode.com/sota/image-retrieval-with-multi-modal-query-on-mit?p=show-and-tell-a-neural-image-caption)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/show-and-tell-a-neural-image-caption/image-retrieval-with-multi-modal-query-on)](https://paperswithcode.com/sota/image-retrieval-with-multi-modal-query-on?p=show-and-tell-a-neural-image-caption)`

Show and Tell: A Neural Image Caption Generator

CVPR 2015 · Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan ·

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

PDF Abstract CVPR 2015 PDF CVPR 2015 Abstract

Code

Add Remove Mark official

karpathy/neuraltalk

5,381

yashk2810/Image-Captioning

325

jazzsaxmafia/show_and_tell.tensorfl…

290

oarriaga/neural_image_captioning

175

HughChi/Image-Caption

144

See all 76 implementations

Tasks

Add Remove

Image Captioning

Image Retrieval with Multi-Modal Query

Sentence

Text Generation

Text-to-Image Generation

Translation

Datasets

MS COCO

Flickr30k

MIT-States

Results from the Paper

Edit

Ranked #3 on Image Retrieval with Multi-Modal Query on MIT-States

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval with Multi-Modal Query	Fashion200k	Show and Tell	Recall@1	12.3	# 6	Compare
			Recall@10	40.2	# 6	Compare
			Recall@50	61.8	# 6	Compare
Image Retrieval with Multi-Modal Query	MIT-States	Show and Tell	Recall@1	11.9	# 3	Compare
			Recall@5	31.0	# 3	Compare
			Recall@10	42.0	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Show and Tell: A Neural Image Caption Generator

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove