TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Image-to-text R@1	52.9	# 23
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Image-to-text R@10	87.2	# 22
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Image-to-text R@5	80.5	# 22
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Text-to-image R@1	39.6	# 23
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Text-to-image R@10	79.5	# 23
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Text-to-image R@5	70.1	# 22

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vse-improving-visual-semantic-embeddings-with/cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/cross-modal-retrieval-on-flickr30k?p=vse-improving-visual-semantic-embeddings-with)`

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

18 Jul 2017 · Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler ·

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

PDF Abstract

Code

Add Remove Mark official

fartashf/vsepp official

486

cshizhe/hgr_v2t

205

Cadene/recipe1m.bootstrap.pytorch

mitjanikolaus/compositional-image-c…

salanueva/UniVSE

See all 10 implementations

Tasks

Add Remove

Cross-Modal Retrieval

Image Retrieval

Retrieval

Structured Prediction

Visual Reasoning

Datasets

MS COCO

Flickr30k

Results from the Paper

Add Remove

Ranked #23 on Cross-Modal Retrieval on Flickr30k

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	Flickr30k	VSE++ (ResNet)	Image-to-text R@1	52.9	# 23	Compare
			Image-to-text R@10	87.2	# 22	Compare
			Image-to-text R@5	80.5	# 22	Compare
			Text-to-image R@1	39.6	# 23	Compare
			Text-to-image R@10	79.5	# 23	Compare
			Text-to-image R@5	70.1	# 22	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove