The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
12,551 PAPERS • 49 BENCHMARKS
CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels indicating facial attributes like hair color, gender and age.
2,879 PAPERS • 18 BENCHMARKS
Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc. The images were crawled from Flickr, thus inheriting all the biases of that website, and automatically aligned and cropped using dlib. Only images under permissive licenses were collected. Various automatic filters were used to prune the set, and finally Amazon Mechanical Turk was used to remove the occasional statues, paintings, or photos of photos.
1,068 PAPERS • 16 BENCHMARKS
The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene categories with more than 5,000 images per category.
969 PAPERS • 6 BENCHMARKS
The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.
735 PAPERS • 12 BENCHMARKS
ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China under varying weather conditions. Pixel-wise semantic annotation of the recorded data is provided in 2D, with point-wise semantic annotation in 3D for 28 classes. In addition, the dataset contains lane marking annotations in 2D.
64 PAPERS • 5 BENCHMARKS
The Places365 dataset is a scene recognition dataset. It is composed of 10 million images comprising 434 scene classes. There are two versions of the dataset: Places365-Standard with 1.8 million train and 36000 validation images from K=365 scene classes, and Places365-Challenge-2016, in which the size of the training set is increased up to 6.2 million extra images, including 69 new scene classes (leading to a total of 8 million train images from 434 scene classes).
51 PAPERS • 8 BENCHMARKS
Fashion-Gen consists of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles.
29 PAPERS • NO BENCHMARKS YET
CASIA V2 is a dataset for forgery classification. It contains 4795 images, 1701 authentic and 3274 forged.
11 PAPERS • NO BENCHMARKS YET
The Free-Form Video Inpainting dataset is a dataset used for training and evaluation video inpainting models. It consists of 1940 videos from the YouTube-VOS dataset and 12,600 videos from the YouTube-BoundingBoxes.
7 PAPERS • NO BENCHMARKS YET
A diverse dataset of human faces, including unconventional poses, occluded faces, and a vast variability in backgrounds.
6 PAPERS • NO BENCHMARKS YET
PAL4Inpaint is a dataset consisting of 4,795 inpainting results with per-pixel perceptual artifacts annotations designed for image inpainting tasks.
2 PAPERS • NO BENCHMARKS YET
The Inpainting dataset consists of synchronized Labeled image and LiDAR scanned point clouds. It's captured by HESAI Pandora All-in-One Sensing Kit. It is collected under various lighting conditions and traffic densities in Beijing, China.
1 PAPER • 1 BENCHMARK
Synthetic training set: This set is constructed in the following two steps and will be used for estimation/training purposes. i) 84,000 275 pixel x 400 pixel ground-truth fingerprint images without any noise or scratches, but with random transformations (at most five pixels translation and +/-10 degrees rotation) were generated by using the software Anguli: Synthetic Fingerprint Generator. ii) 84,000 275 pixel x 400 pixel degraded fingerprint images were generated by applying random artifacts (blur, brightness, contrast, elastic transformation, occlusion, scratch, resolution, rotation) and backgrounds to the ground-truth fingerprint images. In total, it contains 168,000 fingerprint images (84,000 fingerprints, and two impressions - one ground-truth and one degraded - per fingerprint).
1 PAPER • NO BENCHMARKS YET
QST contains 1,167 video clips that are cut out from 216 time-lapse 4K videos collected from YouTube, which can be used for a variety of tasks, such as (high-resolution) video generation, (high-resolution) video prediction, (high-resolution) image generation, texture generation, image inpainting, image/video super-resolution, image/video colorization, image/video animating, etc. Each short clip contains multiple frames (from a minimum of 58 frames to a maximum of 1,200 frames, a total of 285,446 frames), and the resolution of each frame is more than 1,024 x 1,024. Specifically, QST consists of a training set (containing 1000 clips, totally 244,930 frames), a validation set (containing 100 clips, totally 23,200 frames), and a testing set (containing 67 clips, totally 17,316 frames). Click here (Key: qst1) to download the QST dataset.
1 PAPER • NO BENCHMARKS YET
Inpainting networks are typically benchmarked on samples from Places2 dataset. However, this dataset does not have high resolution images for evaluation purposes. Instead, we will use images from the Unsplash-Lite Dataset, which contains 25k high resolution nature-themed photos. We randomly sampled 1000 images from the dataset. Each image is resized and cropped to 1024x1024, and a set of masks is generated with thin, medium, and thick brush strokes, using the methodology described in LaMa.
1 PAPER • NO BENCHMARKS YET