🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language (clear)

791 dataset results for Images AND English

Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.

21 PAPERS • NO BENCHMARKS YET

IGLUE (Image-Grounded Language Understanding Evaluation)

The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

21 PAPERS • 13 BENCHMARKS

REALY (Region-aware benchmark based on the LYHM)

The REALY benchmark aims to introduce a region-aware evaluation pipeline to measure the fine-grained normalized mean square error (NMSE) of 3D face reconstruction methods from under-controlled image sets.

21 PAPERS • 2 BENCHMARKS

TextOCR

TextOCR is a dataset to benchmark text recognition on arbitrary shaped scene-text. TextOCR requires models to perform text-recognition on arbitrary shaped scene-text present on natural images. TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning.

21 PAPERS • NO BENCHMARKS YET

CholecT50 (Cholecystectomy Action Triplet)

CholecT50 is a dataset of endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with triplet information in the form of <instrument, verb, target>. The dataset is a collection of 50 videos consisting of 45 videos from the Cholec80 dataset and 5 videos from an in-house dataset of the same surgical procedure.

20 PAPERS • 7 BENCHMARKS

KITTI-STEP

The Segmenting and Tracking Every Pixel (STEP) benchmark consists of 21 training sequences and 29 test sequences. It is based on the KITTI Tracking Evaluation and the Multi-Object Tracking and Segmentation (MOTS) benchmark. This benchmark extends the annotations to the Segmenting and Tracking Every Pixel (STEP) task. [Copy-pasted from http://www.cvlibs.net/datasets/kitti/eval_step.php]

20 PAPERS • 2 BENCHMARKS

LSUI (Large Scale Underwater Image Dataset)

We released a large-scale underwater image (LSUI) dataset including 5004 image pairs, which involve richer underwater scenes (lighting conditions, water types and target categories) and better visual quality reference images than the existing ones.

20 PAPERS • 1 BENCHMARK

gRefCOCO

gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset that contains multi-target, no-target, and single-target expressions.

20 PAPERS • 2 BENCHMARKS

George Washington

The George Washington dataset contains 20 pages of letters written by George Washington and his associates in 1755 and thereby categorized into historical collection. The images are annotated at word level and contain approximately 5,000 words.

19 PAPERS • NO BENCHMARKS YET

HRSOD (High-Resolution Salient Object Detection)

There exist several datasets for saliency detection, but none of them is specifically designed for high-resolution salient object detection. High-Resolution Salient Object Detection (HRSOD) dataset, containing 1610 training images and 400 test images. The total 2010 images are collected from the website of Flickr with the license of all creative commons. Pixel-level ground truths are manually annotated by 40 subjects. The shortest edge of each image in HRSOD is more than 1200 pixels.

19 PAPERS • 1 BENCHMARK

InfiMM-Eval (Complex Open-ended Reasoning Evaluation for Multi-Modal Language Models)

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Although many benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate CORE-MM benchmark dataset, specifically designed for MLLMs with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. CORE-MM benchmark consists of 279 manually curated reasoning questions, associate

19 PAPERS • 1 BENCHMARK

PSG Dataset

PSG dataset has 48749 images with 133 object classes (80 objects and 53 stuff) and 56 predicate classes. It annotates inter-segment relations based on COCO panoptic segmentation.

19 PAPERS • 1 BENCHMARK

Geometry3K

A new large-scale geometry problem-solving dataset - 3,002 multi-choice geometry problems - dense annotations in formal language for the diagrams and text - 27,213 annotated diagram logic forms (literals) - 6,293 annotated text logic forms (literals)

18 PAPERS • 1 BENCHMARK

HPS (Human POSEitioning System Dataset)

HPS Dataset is a collection of 3D humans interacting with large 3D scenes (300-1000 $m^2$, up to 2500 $m^2$). The dataset contains images captured from a head-mounted camera coupled with the reference 3D pose and location of the person in a pre-scanned 3D scene. 7 people in 8 large scenes are captured performing activities such as exercising, reading, eating, lecturing, using a computer, making coffee, dancing. The dataset provides more than 300K synchronized RGB images coupled with the reference 3D pose and location.

18 PAPERS • NO BENCHMARKS YET

KITTI-C

🤖 Robo3D - The KITTI-C Benchmark KITTI-C is an evaluation benchmark heading toward robust and reliable 3D object detection in autonomous driving. With it, we probe the robustness of 3D detectors under out-of-distribution (OoD) scenarios against corruptions that occur in the real-world environment. Specifically, we consider natural corruptions happen in the following cases:

18 PAPERS • 2 BENCHMARKS

CLEVR-Humans

We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary words and require reasoning skills that are absent from our model’s repertoire

17 PAPERS • 1 BENCHMARK

InfoSeek (Visual Information Seeking)

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

17 PAPERS • 2 BENCHMARKS

OpenImages-v6

OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. It is a partially annotated dataset, with 9,600 trainable classes

17 PAPERS • 3 BENCHMARKS

PKLot (A Robust Dataset for Parking Lot Classification)

The PKLot dataset contains 12,417 images of parking lots and 695,899 images of parking spaces segmented from them, which were manually checked and labeled. All images were acquired at the parking lots of the Federal University of Parana (UFPR) and the Pontificial Catholic University of Parana (PUCPR), both located in Curitiba, Brazil.

17 PAPERS • 1 BENCHMARK

PhotoChat

PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context.

17 PAPERS • 2 BENCHMARKS

Violin (VIdeO-and-Language INference)

Video-and-Language Inference is the task of joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. The Violin dataset is a dataset for this task which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels.

17 PAPERS • NO BENCHMARKS YET

CLEVR-Ref+

CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.

16 PAPERS • 2 BENCHMARKS

MagicBrush

MagicBrush is a manually-annotated instruction-guided image editing dataset covering diverse scenarios single-turn, multi-turn, mask-provided, and mask-free editing. MagicBrush comprises 10K (source image, instruction, target image) triples, which is sufficient to train large-scale image editing models.

16 PAPERS • NO BENCHMARKS YET

PACO (Parts and Attributes of Common Objects)

Parts and Attributes of Common Objects (PACO) is a detection dataset that goes beyond traditional object boxes and masks and provides richer annotations such as part masks and attributes. It spans 75 object categories, 456 object-part categories and 55 attributes across image (LVIS) and video (Ego4D) datasets. The dataset contains 641K part masks annotated across 260K object boxes, with half of them exhaustively annotated with attributes as well.

16 PAPERS • NO BENCHMARKS YET

RealSRSet

20 real low-resolution images selected from existing datasets or downloaded from internet

16 PAPERS • NO BENCHMARKS YET

SeaDronesSee (SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water)

SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.

16 PAPERS • 3 BENCHMARKS

VALSE (VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena)

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

16 PAPERS • 12 BENCHMARKS

Casual Conversations

Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions.

15 PAPERS • NO BENCHMARKS YET

Dress Code

Dress Code is a new dataset for image-based virtual try-on composed of image pairs coming from different catalogs of YOOX NET-A-PORTER. The dataset contains more than 50k high resolution model clothing images pairs divided into three different categories (i.e. dresses, upper-body clothes, lower-body clothes).

15 PAPERS • NO BENCHMARKS YET

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

15 PAPERS • 4 BENCHMARKS

VQA-E

VQA-E is a dataset for Visual Question Answering with Explanation, where the models are required to generate and explanation with the predicted answer. The VQA-E dataset is automatically derived from the VQA v2 dataset by synthesizing a textual explanation for each image-question-answer triple.

15 PAPERS • NO BENCHMARKS YET

XQLFW (Cross-Quality Labeled Faces in the Wild)

An evaluation protocol for face verification focusing on a large intra-pair image quality difference.

15 PAPERS • 1 BENCHMARK

e-SNLI-VE

e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations rely on the image content. It has been built by merging the explanations from e-SNLI and the image-sentence pairs from SNLI-VE.

15 PAPERS • 2 BENCHMARKS

Animal Kingdom

Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.

14 PAPERS • 2 BENCHMARKS

FoodSeg103

FoodSeg103 is a new food image dataset containing 7,118 images. Images are annotated with 104 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. It's provided as a large-scale benchmark for food image segmentation.

14 PAPERS • 1 BENCHMARK

IDRiD (Indian Diabetic Retinopathy Image Dataset)

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a pixel level. This dataset also provides information on the disease severity of diabetic retinopathy and diabetic macular edema for each image. This dataset is perfect for the development and evaluation of image analysis algorithms for early detection of diabetic retinopathy.

14 PAPERS • 3 BENCHMARKS

LIVECell (Label-free In Vitro image Examples of Cells)

The LIVECell (Label-free In Vitro image Examples of Cells) dataset is a large-scale microscopic image dataset for instance-segmentation of individual cells in 2D cell cultures.

14 PAPERS • 1 BENCHMARK

PubTables-1M (PubMed Tables One Million)

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:

14 PAPERS • NO BENCHMARKS YET

SQA3D (Situated Question Answering in 3D Scenes)

SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.

14 PAPERS • 2 BENCHMARKS

CholecT45

CholecT45 is a subset of CholecT50 consisting of 45 videos from the Cholec80 dataset. It is the first public release of part of CholecT50 dataset. CholecT50 is a dataset of 50 endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with 100 triplet classes in the form of <instrument, verb, target>.

13 PAPERS • 2 BENCHMARKS

GRIT (General Robust Image Task Benchmark)

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:

13 PAPERS • 8 BENCHMARKS

InterHuman

InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 16,756 natural language descriptions.

13 PAPERS • 1 BENCHMARK

OVAD benchmark (Open-Vocabulary Attribute Detection)

Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute dete

13 PAPERS • 2 BENCHMARKS

University-1652

Contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. University-1652 is a drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation.

13 PAPERS • 2 BENCHMARKS

COCO-MLT

The COCO-MLT is created from MS COCO-2017, containing 1,909 images from 80 classes. The maximum of training number per class is 1,128 and the minimum is 6. We use the test set of COCO2017 with 5,000 for evaluation. The ratio of head, medium, and tail classes is 22:33:25 in COCO-MLT.

12 PAPERS • 2 BENCHMARKS

Chaoyang

Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.

12 PAPERS • 2 BENCHMARKS

ELPV (A dataset of functional and defective solar cells extracted from EL images of solar modules)

The dataset contains 2,624 samples of $300\times300$ pixels 8-bit grayscale images of functional and defective solar cells with varying degree of degradations extracted from 44 different solar modules. The defects in the annotated images are either of intrinsic or extrinsic type and are known to reduce the power efficiency of solar modules.

12 PAPERS • NO BENCHMARKS YET

Endomapper

The Endomapper dataset is the first collection of complete endoscopy sequences acquired during regular medical practice, including slow and careful screening explorations, making secondary use of medical data. Its original purpose is to facilitate the development and evaluation of VSLAM (Visual Simultaneous Localization and Mapping) methods in real endoscopy data. The first release of the dataset is composed of 50 sequences with a total of more than 13 hours of video. It is also the first endoscopic dataset that includes both the computed geometric and photometric endoscope calibration as well as the original calibration videos. Meta-data and annotations associated to the dataset varies from anatomical landmark and description of the procedure labeling, tools segmentation masks, COLMAP 3D reconstructions, simulated sequences with groundtruth and meta-data related to special cases, such as sequences from the same patient. This information will improve the research in endoscopic VSLAM, a

12 PAPERS • NO BENCHMARKS YET

Datasets

791 dataset results for Images AND English