Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.
3,527 PAPERS • 54 BENCHMARKS
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome
3,458 PAPERS • 141 BENCHMARKS
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
1,119 PAPERS • 28 BENCHMARKS
The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.
888 PAPERS • 14 BENCHMARKS
The SYNTHIA dataset is a synthetic dataset that consists of 9400 multi-viewpoint photo-realistic frames rendered from a virtual city and comes with pixel-level semantic annotations for 13 classes. Each frame has resolution of 1280 × 960.
523 PAPERS • 11 BENCHMARKS
The GTA5 dataset contains 24966 synthetic images with pixel level semantic annotation. The images have been rendered using the open-world video game Grand Theft Auto 5 and are all from the car perspective in the streets of American-style virtual cities. There are 19 semantic classes which are compatible with the ones of Cityscapes dataset.
400 PAPERS • 8 BENCHMARKS
Perceptual Similarity is a dataset of human perceptual similarity judgments.
385 PAPERS • NO BENCHMARKS YET
DeepFashion is a dataset containing around 800K diverse fashion images with their rich annotations (46 categories, 1,000 descriptive attributes, bounding boxes and landmark information) ranging from well-posed product images to real-world-like consumer photos.
376 PAPERS • 6 BENCHMARKS
The Common Objects in COntext-stuff (COCO-stuff) dataset is a dataset for scene understanding tasks like semantic segmentation, object detection and image captioning. It is constructed by annotating the original COCO dataset, which originally annotated things while neglecting stuff annotations. There are 164k images in COCO-stuff dataset that span over 172 categories including 80 things, 91 stuff, and 1 unlabeled class.
305 PAPERS • 20 BENCHMARKS
Animal FacesHQ (AFHQ) is a dataset of animal faces consisting of 15,000 high-quality images at 512 × 512 resolution. The dataset includes three domains of cat, dog, and wildlife, each providing 5000 images. By having multiple (three) domains and diverse images of various breeds (≥ eight) per each domain, AFHQ sets a more challenging image-to-image translation problem. All images are vertically and horizontally aligned to have the eyes at the center. The low-quality images were discarded by human effort.
290 PAPERS • 6 BENCHMARKS
Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a clear image and depth map from Cityscapes. Thus the annotations and data split in Foggy Cityscapes are inherited from Cityscapes.
230 PAPERS • 6 BENCHMARKS
CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has segmentation mask of facial attributes corresponding to CelebA.
155 PAPERS • 5 BENCHMARKS
The Radboud Faces Database (RaFD) is a set of pictures of 67 models (both adult and children, males and females) displaying 8 emotional expressions.
77 PAPERS • 2 BENCHMARKS
Visible-infrared Paired Dataset for Low-light Vision 30976 images (15488 pairs) 24 dark scenes, 2 daytime scenes Support for image-to-image translation (visible to infrared, or infrared to visible), visible and infrared image fusion, low-light pedestrian detection, and infrared pedestrian detection (The original image and video pairs (before registration) of LLVIP are also released!)
75 PAPERS • 7 BENCHMARKS
Synscapes is a synthetic dataset for street scene parsing created using photorealistic rendering techniques, and show state-of-the-art results for training and validation as well as new types of analysis.
44 PAPERS • 1 BENCHMARK
Enables detailed human body model reconstruction in clothing from a single monocular RGB video without requiring a pre scanned template or manually clicked points.
34 PAPERS • NO BENCHMARKS YET
UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots — followed by functional types and individual brands. The shoes are centered on a white background and pictured in the same orientation for convenient analysis.
30 PAPERS • 1 BENCHMARK
IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for each subject includes:
23 PAPERS • 4 BENCHMARKS
VIDIT is a reference evaluation benchmark and to push forward the development of illumination manipulation methods. VIDIT includes 390 different Unreal Engine scenes, each captured with 40 illumination settings, resulting in 15,600 images. The illumination settings are all the combinations of 5 color temperatures (2500K, 3500K, 4500K, 5500K and 6500K) and 8 light directions (N, NE, E, SE, S, SW, W, NW). Original image resolution is 1024x1024.
20 PAPERS • 1 BENCHMARK
An annotated image memorability dataset to date (with 60,000 labeled images from a diverse array of sources).
17 PAPERS • NO BENCHMARKS YET
The evaluation of human epidermal growth factor receptor 2 (HER2) expression is essential to formulate a precise treatment for breast cancer. The routine evaluation of HER2 is conducted with immunohistochemical techniques (IHC), which is very expensive. Therefore, we propose a breast cancer immunohistochemical (BCI) benchmark attempting to synthesize IHC data directly with the paired hematoxylin and eosin (HE) stained images. The dataset contains 4870 registered image pairs, covering a variety of HER2 expression levels (0, 1+, 2+, 3+).
15 PAPERS • 1 BENCHMARK
The selfie dataset contains 46,836 selfie images annotated with 36 different attributes. We only use photos of females as training data and test data. The size of the training dataset is 3400, and that of the test dataset is 100, with the image size of 256 x 256. For the anime dataset, we have firstly retrieved 69,926 animation character images from Anime-Planet1. Among those images, 27,023 face images are extracted by using an anime-face detector2. After selecting only female character images and removing monochrome images manually, we have collected two datasets of female anime face images, with the sizes of 3400 and 100 for training and test data respectively, which is the same numbers as the selfie dataset. Finally, all anime face images are resized to 256 x 256 by applying a CNN-based image super-resolution algorithm.
10 PAPERS • 1 BENCHMARK
SEN12MS-CR-TS is a multi-modal and multi-temporal data set for cloud removal. It contains time-series of paired and co-registered Sentinel-1 and cloudy as well as cloud-free Sentinel-2 data from European Space Agency's Copernicus mission. Each time series contains 30 cloudy and clear observations regularly sampled throughout the year 2018. Our multi-temporal data set is readily pre-processed and backward-compatible with SEN12MS-CR.
7 PAPERS • 1 BENCHMARK
Breast cancer (BC) has become the greatest threat to women’s health worldwide. Clinically, identification of axillary lymph node (ALN) metastasis and other tumor clinical characteristics such as ER, PR, and so on, are important for evaluating the prognosis and guiding the treatment for BC patients.
6 PAPERS • NO BENCHMARKS YET
FFHQ-Aging is a Dataset of human faces designed for benchmarking age transformation algorithms as well as many other possible vision tasks. This dataset is an extention of the NVIDIA FFHQ dataset, on top of the 70,000 original FFHQ images, it also contains the following information for each image: * Gender information (male/female with confidence score) * Age group information (10 classes with confidence score) * Head pose (pitch, roll & yaw) * Glasses type (none, normal or dark) * Eye occlusion score (0-100, different score for each eye) * Full semantic map (19 classes, based on CelebAMask-HQ labels)
This dataset provides the VCIP 2020 Grand Challenge on the NIR Image Colorization dataset. You can refer to https://jchenhkg.github.io/projects/NIR2RGB_VCIP_Challenge/ for a detailed description of this dataset. If you think this dataset is helpful, please feel free to cite our paper: @inproceedings{yang2023cooperative, title={Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation}, author={Yang, Xingxing and Chen, Jie and Yang, Zaifeng}, booktitle={Proceedings of the 31st ACM International Conference on Multimedia}, pages={2409--2417}, year={2023} }
3 PAPERS • 1 BENCHMARK
Mila Simulated Floods Dataset is a 1.5 square km virtual world using the Unity3D game engine including urban, suburban and rural areas.
2 PAPERS • 1 BENCHMARK
An experimental and synthetic (simulated) OA raw signals and reconstructed image domain datasets rendered with different experimental parameters and tomographic acquisition geometries.
2 PAPERS • NO BENCHMARKS YET
This dataset contains synthetic images extracted from the CARLA simulator along with rich information extracted from the deferred rendering pipeline of Unreal Engine 4. The main purpose of this dataset is the training of the state-of-the-art image-to-image translation model proposed by Intel Labs "Enhancing Photorealism Enhancement" (EPE). Translation results derived from the model targeting the characteristics of Cityscapes, KITTI, and Mapillary Vistas are also provided. Computer vision-based models trained on these data are expected to perform better when deployed in the real world.
1 PAPER • NO BENCHMARKS YET
LISA Gaze is a dataset for driver gaze estimation comprising of 11 long drives, driven by 10 subjects in two different cars.
UDA-CH contains 16 objects that cover a variety of artworks which can be found in a museum like sculptures, paintings and books. Specifically, the dataset has been collected inside the cultural site “Galleria Regionale di Palazzo Bellomo” located in Siracusa, Italy.
1 PAPER • 1 BENCHMARK