🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

2661 dataset results for Images

A-OKVQA is crowdsourced visual question answering dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer.

104 PAPERS • 1 BENCHMARK

CC12M (Conceptual 12M)

Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.

104 PAPERS • NO BENCHMARKS YET

FreiHAND

FreiHAND is a 3D hand pose dataset which records different hand actions performed by 32 people. For each hand image, MANO-based 3D hand pose annotations are provided. It currently contains 32,560 unique training samples and 3960 unique samples for evaluation. The training samples are recorded with a green screen background allowing for background removal. In addition, it applies three different post processing strategies to training samples for data augmentation. However, these post processing strategies are not applied to evaluation samples.

104 PAPERS • 1 BENCHMARK

CSIQ

CSIQ (Categorical Subjective Image Quality)

The CSIQ database consists of 30 original images, each is distorted using six different types of distortions at four to five different levels of distortion. CSIQ images are subjectively rated base on a linear displacement of the images across four calibrated LCD monitors placed side by side with equal viewing distance to the observer. The database contains 5000 subjective ratings from 35 different observers, and ratings are reported in the form of DMOS.

102 PAPERS • 1 BENCHMARK

Penn Action

The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence.

102 PAPERS • 4 BENCHMARKS

SegTrack-v2

SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video.

102 PAPERS • 4 BENCHMARKS

Stylized ImageNet

The Stylized-ImageNet dataset is created by removing local texture cues in ImageNet while retaining global shape information on natural images via AdaIN style transfer. This nudges CNNs towards learning more about shapes and less about local textures.

101 PAPERS • 1 BENCHMARK

PubLayNet

PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.

100 PAPERS • 1 BENCHMARK

Tiny Images

The image dataset TinyImages contains 80 million images of size 32×32 collected from the Internet, crawling the words in WordNet.

100 PAPERS • NO BENCHMARKS YET

WFLW (Wider Facial Landmarks in the Wild)

The Wider Facial Landmarks in the Wild or WFLW database contains 10000 faces (7500 for training and 2500 for testing) with 98 annotated landmarks. This database also features rich attribute annotations in terms of occlusion, head pose, make-up, illumination, blur and expressions.

100 PAPERS • 4 BENCHMARKS

HO-3D

A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.

99 PAPERS • 2 BENCHMARKS

FRGC (Face Recognition Grand Challenge)

The data for FRGC consists of 50,000 recordings divided into training and validation partitions. The training partition is designed for training algorithms and the validation partition is for assessing performance of an approach in a laboratory setting. The validation partition consists of data from 4,003 subject sessions. A subject session is the set of all images of a person taken each time a person's biometric data is collected and consists of four controlled still images, two uncontrolled still images, and one three-dimensional image. The controlled images were taken in a studio setting, are full frontal facial images taken under two lighting conditions and with two facial expressions (smiling and neutral). The uncontrolled images were taken in varying illumination conditions; e.g., hallways, atriums, or outside. Each set of uncontrolled images contains two expressions, smiling and neutral. The 3D image was taken under controlled illumination conditions. The 3D images consist of bo

98 PAPERS • 1 BENCHMARK

VIST (Visual Storytelling)

The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on Flickr. The albums included 10 to 50 images and all the images in an album are taken in a 48-hour span. The stories were created by workers on Amazon Mechanical Turk, where the workers were instructed to choose five images from the album and write a story about them. Every story has five sentences, and every sentence is paired with its appropriate image. The dataset is split into 3 subsets, a training set (80%), a validation set (10%) and a test set (10%). All the words and interpunction signs in the stories are separated by a space character and all the location names are replaced with the word location. All the names of people are replaced with the words male or female depending on the gender of the person.

98 PAPERS • 2 BENCHMARKS

Visual7W

Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. It is collected from 47,300 COCO images and it has 327,929 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories.

98 PAPERS • 1 BENCHMARK

BP4D

The BP4D-Spontaneous dataset is a 3D video database of spontaneous facial expressions in a diverse group of young adults. Well-validated emotion inductions were used to elicit expressions of emotion and paralinguistic communication. Frame-level ground-truth for facial actions was obtained using the Facial Action Coding System. Facial features were tracked in both 2D and 3D domains using both person-specific and generic approaches. The database includes forty-one participants (23 women, 18 men). They were 18 – 29 years of age; 11 were Asian, 6 were African-American, 4 were Hispanic, and 20 were Euro-American. An emotion elicitation protocol was designed to elicit emotions of participants effectively. Eight tasks were covered with an interview process and a series of activities to elicit eight emotions. The database is structured by participants. Each participant is associated with 8 tasks. For each task, there are both 3D and 2D videos. As well, the Metadata include manually annotated

96 PAPERS • 3 BENCHMARKS

Cholec80 (Surgical Workflow Dataset)

Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured at 25 fps and downsampled to 1 fps for processing. The whole dataset is labeled with the phase and tool presence annotations. The phases have been defined by a senior surgeon in Strasbourg hospital, France. Since the tools are sometimes hardly visible in the images and thus difficult to be recognized visually, a tool is defined as present in an image if at least half of the tool tip is visible.

96 PAPERS • 2 BENCHMARKS

NCLT (North Campus Long-Term Vision and LiDAR)

The NCLT dataset is a large scale, long-term autonomy dataset for robotics research collected on the University of Michigan’s North Campus. The dataset consists of omnidirectional imagery, 3D lidar, planar lidar, GPS, and proprioceptive sensors for odometry collected using a Segway robot. The dataset was collected to facilitate research focusing on long-term autonomous operation in changing environments. The dataset is comprised of 27 sessions spaced approximately biweekly over the course of 15 months. The sessions repeatedly explore the campus, both indoors and outdoors, on varying trajectories, and at different times of the day across all four seasons. This allows the dataset to capture many challenging elements including: moving obstacles (e.g., pedestrians, bicyclists, and cars), changing lighting, varying viewpoint, seasonal and weather changes (e.g., falling leaves and snow), and long-term structural changes caused by construction projects.

96 PAPERS • NO BENCHMARKS YET

RVL-CDIP

The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. The dataset has 320,000 training, 40,000 validation and 40,000 test images. The images are characterized by low quality, noise, and low resolution, typically 100 dpi.

96 PAPERS • 3 BENCHMARKS

ImageNet-32

Imagenet32 is a huge dataset made up of small images called the down-sampled version of Imagenet. Imagenet32 is composed of 1,281,167 training data and 50,000 test data with 1,000 labels.

93 PAPERS • 4 BENCHMARKS

CAMO (Camouflaged Object)

Camouflaged Object (CAMO) dataset specifically designed for the task of camouflaged object segmentation. We focus on two categories, i.e., naturally camouflaged objects and artificially camouflaged objects, which usually correspond to animals and humans in the real world, respectively. Camouflaged object images consists of 1250 images (1000 images for the training set and 250 images for the testing set). Non-camouflaged object images are collected from the MS-COCO dataset (1000 images for the training set and 250 images for the testing set). CAMO has objectness mask ground-truth.

92 PAPERS • 2 BENCHMARKS

CUHK-SYSU (CUHK-SYSU Person Search Dataset)

The CUKL-SYSY dataset is a large scale benchmark for person search, containing 18,184 images and 8,432 identities. Different from previous re-id benchmarks, matching query persons with manually cropped pedestrians, this dataset is much closer to real application scenarios by searching person from whole images in the gallery.

92 PAPERS • 2 BENCHMARKS

Cambridge Landmarks

Cambridge Landmarks, a large scale outdoor visual relocalisation dataset taken around Cambridge University. Contains original video, with extracted image frames labelled with their 6-DOF camera pose and a visual reconstruction of the scene. If you use this data, please cite our paper: Alex Kendall, Matthew Grimes and Roberto Cipolla "PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization." Proceedings of the International Conference on Computer Vision (ICCV), 2015.

92 PAPERS • NO BENCHMARKS YET

ImageCLEF-DA

The ImageCLEF-DA dataset is a benchmark dataset for ImageCLEF 2014 domain adaptation challenge, which contains three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I) and Pascal VOC 2012 (P). For each domain, there are 12 categories and 50 images in each category.

92 PAPERS • 5 BENCHMARKS

InBreast

Rationale and objectives: Computer-aided detection and diagnosis (CAD) systems have been developed in the past two decades to assist radiologists in the detection and diagnosis of lesions seen on breast imaging exams, thus providing a second opinion. Mammographic databases play an important role in the development of algorithms aiming at the detection and diagnosis of mammary lesions. However, available databases often do not take into consideration all the requirements needed for research and study purposes. This article aims to present and detail a new mammographic database.

90 PAPERS • 2 BENCHMARKS

PoseTrack

The PoseTrack dataset is a large-scale benchmark for multi-person pose estimation and tracking in videos. It requires not only pose estimation in single frames, but also temporal tracking across frames. It contains 514 videos including 66,374 frames in total, split into 300, 50 and 208 videos for training, validation and test set respectively. For training videos, 30 frames from the center are annotated. For validation and test videos, besides 30 frames from the center, every fourth frame is also annotated for evaluating long range articulated tracking. The annotations include 15 body keypoints location, a unique person id and a head bounding box for each person instance.

90 PAPERS • 5 BENCHMARKS

FaceWarehouse

FaceWarehouse is a 3D facial expression database that provides the facial geometry of 150 subjects, covering a wide range of ages and ethnic backgrounds.

89 PAPERS • NO BENCHMARKS YET

MM-Vet

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

89 PAPERS • 1 BENCHMARK

SA-1B

SA-1B consists of 11M diverse, high resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks.

89 PAPERS • NO BENCHMARKS YET

SYSU-MM01

The SYSU-MM01 is a dataset collected for the Visible-Infrared Re-identification problem. The images in the dataset were obtained from 491 different persons by recording them using 4 RGB and 2 infrared cameras. Within the dataset, the persons are divided into 3 fixed splits to create training, validation and test sets. In the training set, there are 20284 RGB and 9929 infrared images of 296 persons. The validation set contains 1974 RGB and 1980 infrared images of 99 persons. The testing set consists of the images of 96 persons where 3803 infrared images are used as query and 301 randomly selected RGB images are used as gallery.

89 PAPERS • 2 BENCHMARKS

VGG Face

The VGG Face dataset is face identity recognition dataset that consists of 2,622 identities. It contains over 2.6 million images.

88 PAPERS • NO BENCHMARKS YET

JAFFE (Japanese Female Facial Expression)

The JAFFE dataset consists of 213 images of different facial expressions from 10 different Japanese female subjects. Each subject was asked to do 7 facial expressions (6 basic facial expressions and neutral) and the images were annotated with average semantic ratings on each facial expression by 60 annotators.

87 PAPERS • 4 BENCHMARKS

Raindrop

Raindrop is a set of image pairs, where each pair contains exactly the same background scene, yet one is degraded by raindrops and the other one is free from raindrops. To obtain this, the images are captured through two pieces of exactly the same glass: one sprayed with water, and the other is left clean. The dataset consists of 1,119 pairs of images, with various background scenes and raindrops. They were captured with a Sony A6000 and a Canon EOS 60.

87 PAPERS • 1 BENCHMARK

iSUN

iSUN is a ground truth of gaze traces on images from the SUN dataset. The collection is partitioned into 6,000 images for training, 926 for validation and 2,000 for test.

87 PAPERS • NO BENCHMARKS YET

IDD (Indian Driving Dataset)

IDD is a dataset for road scene understanding in unstructured environments used for semantic segmentation and object detection for autonomous driving. It consists of 10,004 images, finely annotated with 34 classes collected from 182 drive sequences on Indian roads.

86 PAPERS • NO BENCHMARKS YET

Mapillary Vistas Dataset

Mapillary Vistas Dataset is a diverse street-level imagery dataset with pixel‑accurate and instance‑specific human annotations for understanding street scenes around the world.

86 PAPERS • 3 BENCHMARKS

CrowdPose

The CrowdPose dataset contains about 20,000 images and a total of 80,000 human poses with 14 labeled keypoints. The test set includes 8,000 images. The crowded images containing homes are extracted from MSCOCO, MPII and AI Challenger.

85 PAPERS • 2 BENCHMARKS

Kvasir (The Kvasir Dataset)

The KVASIR Dataset was released as part of the medical multimedia challenge presented by MediaEval. It is based on images obtained from the GI tract via an endoscopy procedure. The dataset is composed of images that are annotated and verified by medical doctors, and captures 8 different classes. The classes are based on three anatomical landmarks (z-line, pylorus, cecum), three pathological findings (esophagitis, polyps, ulcerative colitis) and two other classes (dyed and lifted polyps, dyed resection margins) related to the polyp removal process. Overall, the dataset contains 8,000 endoscopic images, with 1,000 image examples per class.

85 PAPERS • 3 BENCHMARKS

McMaster

The McMaster dataset is a dataset for color demosaicing, which contains 18 cropped images of size 500×500.

84 PAPERS • 6 BENCHMARKS

PadChest

PadChest is a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score.

84 PAPERS • NO BENCHMARKS YET

LUNA16

The LUNA16 (LUng Nodule Analysis) dataset is a dataset for lung segmentation. It consists of 1,186 lung nodules annotated in 888 CT scans.

83 PAPERS • 1 BENCHMARK

MSU-MFSD

The MSU-MFSD dataset contains 280 video recordings of genuine and attack faces. 35 individuals have participated in the development of this database with a total of 280 videos. Two kinds of cameras with different resolutions (720×480 and 640×480) were used to record the videos from the 35 individuals. For the real accesses, each individual has two video recordings captured with the Laptop cameras and Android, respectively. For the video attacks, two types of cameras, the iPhone and Canon cameras were used to capture high definition videos on each of the subject. The videos taken with Canon camera were then replayed on iPad Air screen to generate the HD replay attacks while the videos recorded by the iPhone mobile were replayed itself to generate the mobile replay attacks. Photo attacks were produced by printing the 35 subjects’ photos on A3 papers using HP colour printer. The recording videos with respect to the 35 individuals were divided into training (15 subjects with 120 videos) an

83 PAPERS • 1 BENCHMARK

Omniglot

The Omniglot data set is designed for developing more human-like learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. Each of the 1623 characters was drawn online via Amazon's Mechanical Turk by 20 different people. Each image is paired with stroke data, a sequences of [x,y,t] coordinates with time (t) in milliseconds.

83 PAPERS • 15 BENCHMARKS

VOC 2012 (The PASCAL Visual Object Classes Challenge 2012)

see detailed use case on code implementation of the paper 'Tell Me Where To Look: Guided Attention Inference Networks'

83 PAPERS • NO BENCHMARKS YET

Kuzushiji-MNIST

Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images). Since MNIST restricts us to 10 classes, the authors chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST. Kuzushiji is a Japanese cursive writing style.

82 PAPERS • 2 BENCHMARKS

SUN360 (Scene UNderstanding 360° panorama)

The goal of the SUN360 panorama database is to provide academic researchers in computer vision, computer graphics and computational photography, cognition and neuroscience, human perception, machine learning and data mining, with a comprehensive collection of annotated panoramas covering 360x180-degree full view for a large variety of environmental scenes, places and the objects within. To build the core of the dataset, the authors download a huge number of high-resolution panorama images from the Internet, and group them into different place categories. Then, they designed a WebGL annotation tool for annotating the polygons and cuboids for objects in the scene.

82 PAPERS • 1 BENCHMARK

Aachen Day-Night

Aachen Day-Night is a dataset designed for benchmarking 6DOF outdoor visual localization in changing conditions. It focuses on localizing high-quality night-time images against a day-time 3D model. There are 14,607 images with changing conditions of weather, season and day-night cycles.

81 PAPERS • 1 BENCHMARK

RAVEN

RAVEN consists of 1,120,000 images and 70,000 RPM (Raven's Progressive Matrices) problems, equally distributed in 7 distinct figure configurations.

81 PAPERS • NO BENCHMARKS YET

xView (mejdi)

xView is one of the largest publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes. It contains over 1M object instances from 60 different classes.

81 PAPERS • 1 BENCHMARK

Datasets

2661 dataset results for Images