TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	RSICD	GeoRSCLIP-FT	Mean Recall	38.87%	# 1
Cross-Modal Retrieval	RSICD	GeoRSCLIP-FT	Image-to-text R@1	21.13%	# 1
Cross-Modal Retrieval	RSICD	GeoRSCLIP-FT	text-to-image R@1	15.59%	# 1
Text Retrieval	RSICD	GeoRSCLIP-FT	Recall@1	15.59%	# 1
Image-to-Text Retrieval	RSICD	GeoRSCLIP-FT	Image to Text Recall@1	22.14%	# 1
Cross-Modal Retrieval	RSITMD	GeoRSCLIP-FT	Mean Recall	51.81%	# 1
Cross-Modal Retrieval	RSITMD	GeoRSCLIP-FT	Image-to-text R@1	32.30%	# 1
Cross-Modal Retrieval	RSITMD	GeoRSCLIP-FT	text-to-imageR@1	25.04%	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rs5m-a-large-scale-vision-language-dataset/cross-modal-retrieval-on-rsicd)](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsicd?p=rs5m-a-large-scale-vision-language-dataset)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rs5m-a-large-scale-vision-language-dataset/text-retrieval-on-rsicd)](https://paperswithcode.com/sota/text-retrieval-on-rsicd?p=rs5m-a-large-scale-vision-language-dataset)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rs5m-a-large-scale-vision-language-dataset/image-to-text-retrieval-on-rsicd)](https://paperswithcode.com/sota/image-to-text-retrieval-on-rsicd?p=rs5m-a-large-scale-vision-language-dataset)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rs5m-a-large-scale-vision-language-dataset/cross-modal-retrieval-on-rsitmd)](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsitmd?p=rs5m-a-large-scale-vision-language-dataset)`

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

20 Jun 2023 · Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin ·

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$ in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-Modal Text-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo) tasks. Dataset and models have been released in: \url{https://github.com/om-ai-lab/RS5M}.

PDF Abstract

Code

Add Remove Mark official

om-ai-lab/rs5m official

155

Tasks

Add Remove

Cross-Modal Retrieval

Image Retrieval

Image-to-Text Retrieval

Language Modelling

Retrieval

Text Retrieval

Zero-Shot Learning

Datasets

EuroSAT

LAION-400M

RESISC45

fMoW

CC12M

WIT BigEarthNet

RSICD

RedCaps

Million-AID RSITMD

Results from the Paper

Edit

Ranked #1 on Cross-Modal Retrieval on RSITMD (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	RSICD	GeoRSCLIP-FT	Mean Recall	38.87%	# 1	Compare
			Image-to-text R@1	21.13%	# 1	Compare
			text-to-image R@1	15.59%	# 1	Compare
Text Retrieval	RSICD	GeoRSCLIP-FT	Recall@1	15.59%	# 1	Compare
Image-to-Text Retrieval	RSICD	GeoRSCLIP-FT	Image to Text Recall@1	22.14%	# 1	Compare
Cross-Modal Retrieval	RSITMD	GeoRSCLIP-FT	Mean Recall	51.81%	# 1	Compare
			Image-to-text R@1	32.30%	# 1	Compare
			text-to-imageR@1	25.04%	# 1	Compare

Methods

Add Remove

CLIP • Diffusion

Edit Social Preview

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove