Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
1,935 PAPERS • 28 BENCHMARKS
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
1,862 PAPERS • 101 BENCHMARKS
The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.
395 PAPERS • 11 BENCHMARKS
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
375 PAPERS • 10 BENCHMARKS
The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and comes with pixel-level semantic annotations for 13 classes. Each frame has resolution of 1280 × 960.
298 PAPERS • 9 BENCHMARKS
The GTA5 dataset contains 24966 synthetic images with pixel level semantic annotation. The images have been rendered using the open-world video game Grand Theft Auto 5 and are all from the car perspective in the streets of American-style virtual cities. There are 19 semantic classes which are compatible with the ones of Cityscapes dataset.
231 PAPERS • 6 BENCHMARKS
DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding boxes and landmark information) ranging from well-posed product images to real-world-like consumer photos.
215 PAPERS • 5 BENCHMARKS
Perceptual Similarity is a dataset of human perceptual similarity judgments.
158 PAPERS • NO BENCHMARKS YET
The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations. There are 164k images in COCO-stuff dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.
121 PAPERS • 9 BENCHMARKS
Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a clear image and depth map from Cityscapes. Thus the annotations and data split in Foggy Cityscapes are inherited from Cityscapes.
106 PAPERS • 3 BENCHMARKS
Animal FacesHQ (AFHQ) is a dataset of animal faces consisting of 15,000 high-quality images at 512 × 512 resolution. The dataset includes three domains of cat, dog, and wildlife, each providing 5000 images. By having multiple (three) domains and diverse images of various breeds (≥ eight) per each domain, AFHQ sets a more challenging image-to-image translation problem. All images are vertically and horizontally aligned to have the eyes at the center. The low-quality images were discarded by human effort.
73 PAPERS • 6 BENCHMARKS
The Radboud Faces Database (RaFD) is a set of pictures of 67 models (both adult and children, males and females) displaying 8 emotional expressions.
67 PAPERS • 2 BENCHMARKS
CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has segmentation mask of facial attributes corresponding to CelebA.
55 PAPERS • 1 BENCHMARK
Synscapes is a synthetic dataset for street scene parsing created using photorealistic rendering techniques, and show state-of-the-art results for training and validation as well as new types of analysis.
24 PAPERS • 1 BENCHMARK
UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots — followed by functional types and individual brands. The shoes are centered on a white background and pictured in the same orientation for convenient analysis.
24 PAPERS • NO BENCHMARKS YET
VIDIT is a reference evaluation benchmark and to push forward the development of illumination manipulation methods. VIDIT includes 390 different Unreal Engine scenes, each captured with 40 illumination settings, resulting in 15,600 images. The illumination settings are all the combinations of 5 color temperatures (2500K, 3500K, 4500K, 5500K and 6500K) and 8 light directions (N, NE, E, SE, S, SW, W, NW). Original image resolution is 1024x1024.
17 PAPERS • 1 BENCHMARK
Enables detailed human body model reconstruction in clothing from a single monocular RGB video without requiring a pre scanned template or manually clicked points.
13 PAPERS • NO BENCHMARKS YET
An annotated image memorability dataset to date (with 60,000 labeled images from a diverse array of sources).
11 PAPERS • NO BENCHMARKS YET
The selfie dataset contains 46,836 selfie images annotated with 36 different attributes. We only use photos of females as training data and test data. The size of the training dataset is 3400, and that of the test dataset is 100, with the image size of 256 x 256. For the anime dataset, we have firstly retrieved 69,926 animation character images from Anime-Planet1. Among those images, 27,023 face images are extracted by using an anime-face detector2. After selecting only female character images and removing monochrome images manually, we have collected two datasets of female anime face images, with the sizes of 3400 and 100 for training and test data respectively, which is the same numbers as the selfie dataset. Finally, all anime face images are resized to 256 x 256 by applying a CNN-based image super-resolution algorithm.
8 PAPERS • 1 BENCHMARK
FFHQ-Aging is a Dataset of human faces designed for benchmarking age transformation algorithms as well as many other possible vision tasks. This dataset is an extention of the NVIDIA FFHQ dataset, on top of the 70,000 original FFHQ images, it also contains the following information for each image: * Gender information (male/female with confidence score) * Age group information (10 classes with confidence score) * Head pose (pitch, roll & yaw) * Glasses type (none, normal or dark) * Eye occlusion score (0-100, different score for each eye) * Full semantic map (19 classes, based on CelebAMask-HQ labels)
4 PAPERS • NO BENCHMARKS YET
Visible-infrared Paired Dataset for Low-light Vision 30976 images (15488 pairs) 24 dark scenes, 2 daytime scenes Support for image-to-image translation (visible to infrared, or infrared to visible), visible and infrared image fusion, low-light pedestrian detection, and infrared pedestrian detection
3 PAPERS • 4 BENCHMARKS
IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for each subject includes:
2 PAPERS • 2 BENCHMARKS
Breast cancer (BC) has become the greatest threat to women’s health worldwide. Clinically, identification of axillary lymph node (ALN) metastasis and other tumor clinical characteristics such as ER, PR, and so on, are important for evaluating the prognosis and guiding the treatment for BC patients.
1 PAPER • NO BENCHMARKS YET
LISA Gaze is a dataset for driver gaze estimation comprising of 11 long drives, driven by 10 subjects in two different cars.
1 PAPER • NO BENCHMARKS YET
Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and rural areas.
1 PAPER • NO BENCHMARKS YET
UDA-CH contains 16 objects that cover a variety of artworks which can be found in a museum like sculptures, paintings and books. Specifically, the dataset has been collected inside the cultural site “Galleria Regionale di Palazzo Bellomo” located in Siracusa, Italy.
1 PAPER • NO BENCHMARKS YET