Image Retrieval

666 papers with code • 54 benchmarks • 75 datasets

Image Retrieval is a fundamental and long-standing computer vision task that involves finding images similar to a provided query from a large database. It's often considered as a form of fine-grained, instance-level classification. Not just integral to image recognition alongside classification and detection, it also holds substantial business value by helping users discover images aligning with their interests or requirements, guided by visual similarity or other parameters.

( Image credit: DELF )

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Retrieval

Dataset	Best Model	Compare
ROxford (Medium)	Hypergraph propagation+Community selection	See all
RParis (Medium)	Hypergraph propagation	See all
ROxford (Hard)	SuperGlobal	See all
RParis (Hard)	SuperGlobal	See all
CREPE (Compositional REPresentation Evaluation)	ViT-L-14 (LAION400M)	See all
Flickr30K 1K test	X-VLM (base)	See all
Fashion IQ	SPRC	See all
SOP	Unicom+ViT-L@336px	See all
Oxf5k	Offline Diffusion	See all
Flickr30k-CN	InternVL-G-FT	See all
CIRR	SPRC	See all
iNaturalist	Unicom+ViT-L@336px	See all
Oxf105k	Offline Diffusion	See all
MUGE Retrieval	CN-CLIP (ViT-H/14)	See all
COCO-CN	CN-CLIP (ViT-H/14)	See all
CUB-200-2011	CGD (MG/SG)	See all
CARS196	CGD (MG/SG)	See all
Par6k	Offline Diffusion	See all
Par106k	Offline Diffusion	See all
In-Shop	CGD (SG/GS)	See all
Flickr30k	BLIP-2 ViT-G (zero-shot, 1K test set)	See all
MS COCO	BLIP-2 ViT-G (fine-tuned)	See all
AmsterTime	DINOv2 distilled (ViT-L/14 frozen)	See all
PhotoChat	PaCE	See all
ConQA Descriptive	CLIP	See all
ConQA Conceptual	CLIP	See all
DeepFashion - Consumer-to-shop	CTL Model (ResNet50-IBN-A, 320x320)	See all
Exact Street2Shop	CTL Model (ResNet50-IBN-A, 320x320)	See all
LaSCo	CASE	See all
DeepPatent	SwinV2	See all
24/7 Tokyo	HED-N-GAN	See all
street2shop - topwear	Ranknet	See all
INRIA Holidays	MultiGrain R50 @ 800	See all
Paris6k	IME layer	See all
Oxford5k	GNN-Reranking	See all
AIC-ICC	ERNIE-ViL2.0	See all
WIT	WIT-ALL	See all
CBVS	UniCLP	See all
NUS-WIDE	LESA	See all
DeepFashion	STIR	See all
Google Landmarks Dataset v2 (retrieval, testing)	ResNet101+ArcFace GLDv2-train-clean	See all
Google Landmarks Dataset v2 (retrieval, validation)	ResNet101+ArcFace GLDv2-train-clean	See all
INSTRE	IME layer	See all
CIFAR-10	Custom: 3 conv + 2 fcn	See all
ImageCoDe	ContextualCLIP	See all
PKU-Reid	IHDA	See all
PKU SketchRe-ID Dataset	IHDA	See all
FETA Car-Manuals	FETA's CLIP-MIL (Many-Shot Image-to-text)	See all
FooDI-ML (Global)	ADAPT-I2T	See all
FooDI-ML (Spain)	ADAPT-I2T	See all
Localized Narratives	OPT	See all
ICFG-PEDES	SSAN	See all
RUC-CAS-WenLan	CMCL	See all
ROxford Medium without fine-tuning	HesAff–rSIFT–VLAD	See all

Show all 54 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Image Retrieval models and implementations

huggingface/transformers

4 papers

124,984

OML-Team/open-metric-learning

4 papers

762

kornia/kornia

2 papers

9,377

salesforce/lavis

2 papers

8,724

See all 10 libraries.

Datasets

Subtasks

Medical Image Retrieval

Multi-Label Image Retrieval

Face Image Retrieval

Video-to-Shop

Image Instance Retrieval

Semi-Supervised Sketch Based Image Retrieval

Chat-based Image Retrieval

Most implemented papers

Most implemented Social Latest No code

Emerging Properties in Self-Supervised Vision Transformers

facebookresearch/dino • • ICCV 2021

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).

Paper
Code

VGGFace2: A dataset for recognising faces across pose and age

deepinsight/insightface • • 23 Oct 2017

The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise.

Paper
Code

NetVLAD: CNN architecture for weakly supervised place recognition

Relja/netvlad • CVPR 2016

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.

Paper
Code

Fine-tuning CNN Image Retrieval with No Human Annotation

filipradenovic/cnnimageretrieval-pytorch • • 3 Nov 2017

We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval.

Paper
Code

Large-Scale Image Retrieval with Attentive Deep Local Features

tensorflow/models • • ICCV 2017

We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELF (DEep Local Feature).

Paper
Code

Circle Loss: A Unified Perspective of Pair Similarity Optimization

layumi/Person_reID_baseline_pytorch • • CVPR 2020

This paper provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity $s_p$ and minimize the between-class similarity $s_n$.

Paper
Code

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis • • 30 Jan 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Paper
Code

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task • • NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Paper
Code

DINOv2: Learning Robust Visual Features without Supervision

facebookresearch/dinov2 • • 14 Apr 2023

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision.

Paper
Code

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

fartashf/vsepp • • 18 Jul 2017

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval.

Paper
Code

Image Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result