TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	Recipe1M	VLPCook (R1M+)	Image-to-text R@1	74.9	# 1
Cross-Modal Retrieval	Recipe1M	VLPCook (R1M+)	Text-to-image R@1	75.6	# 1
Cross-Modal Retrieval	Recipe1M	VLPCook	Image-to-text R@1	73.6	# 2
Cross-Modal Retrieval	Recipe1M	VLPCook	Text-to-image R@1	74.7	# 2
Cross-Modal Retrieval	Recipe1M+	VLPCook	Image-to-text R@1	45.2	# 1
Cross-Modal Retrieval	Recipe1M+	VLPCook	Text-to-image R@1	47.3	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/structured-vision-language-pretraining-for/cross-modal-retrieval-on-recipe1m)](https://paperswithcode.com/sota/cross-modal-retrieval-on-recipe1m?p=structured-vision-language-pretraining-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/structured-vision-language-pretraining-for/cross-modal-retrieval-on-recipe1m-1)](https://paperswithcode.com/sota/cross-modal-retrieval-on-recipe1m-1?p=structured-vision-language-pretraining-for)`

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

8 Dec 2022 · Mustafa Shukor, Nicolas Thome, Matthieu Cord ·

Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. Finally, we validate the generalization of the approach to other tasks (i.e, Food Recognition) and domains with structured text such as the Medical domain on the ROCO dataset. The code is available here: https://github.com/mshukor/VLPCook

PDF Abstract

Code

Add Remove Mark official

mshukor/vlpcook official

Tasks

Add Remove

Cross-Modal Retrieval

Food Recognition

Retrieval

Datasets

MS COCO

Visual Genome

Food-101

Recipe1M+ SBU Captions Dataset

Results from the Paper

Edit

Ranked #1 on Cross-Modal Retrieval on Recipe1M+

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	Recipe1M	VLPCook (R1M+)	Image-to-text R@1	74.9	# 1	Compare
Cross-Modal Retrieval	Recipe1M	VLPCook (R1M+)	Text-to-image R@1	75.6	# 1	Compare
Cross-Modal Retrieval	Recipe1M	VLPCook	Image-to-text R@1	73.6	# 2	Compare
Cross-Modal Retrieval	Recipe1M	VLPCook	Text-to-image R@1	74.7	# 2	Compare
Cross-Modal Retrieval	Recipe1M+	VLPCook	Image-to-text R@1	45.2	# 1	Compare
Cross-Modal Retrieval	Recipe1M+	VLPCook	Text-to-image R@1	47.3	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove