The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
7,661 PAPERS • 80 BENCHMARKS
The Places dataset is proposed for scene recognition and contains more than 2.5 million images covering more than 205 scene categories with more than 5,000 images per category.
544 PAPERS • 3 BENCHMARKS
Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc. The images were crawled from Flickr, thus inheriting all the biases of that website, and automatically aligned and cropped using dlib. Only images under permissive licenses were collected. Various automatic filters were used to prune the set, and finally Amazon Mechanical Turk was used to remove the occasional statues, paintings, or photos of photos.
417 PAPERS • 12 BENCHMARKS
The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.
393 PAPERS • 11 BENCHMARKS
ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China under varying weather conditions. Pixel-wise semantic annotation of the recorded data is provided in 2D, with point-wise semantic annotation in 3D for 28 classes. In addition, the dataset contains lane marking annotations in 2D.
43 PAPERS • 5 BENCHMARKS
Fashion-Gen consists of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles.
13 PAPERS • NO BENCHMARKS YET
The Places365 dataset is a scene recognition dataset. It is composed of 10 million images comprising 434 scene classes. There are two versions of the dataset: Places365-Standard with 1.8 million train and 36000 validation images from K=365 scene classes, and Places365-Challenge-2016, in which the size of the training set is increased up to 6.2 million extra images, including 69 new scene classes (leading to a total of 8 million train images from 434 scene classes).
13 PAPERS • 5 BENCHMARKS
CASIA V2 is a dataset for forgery classification. It contains 4795 images, 1701 authentic and 3274 forged.
6 PAPERS • NO BENCHMARKS YET
The Free-Form Video Inpainting dataset is a dataset used for training and evaluation video inpainting models. It consists of 1940 videos from the YouTube-VOS dataset and 12,600 videos from the YouTube-BoundingBoxes.
5 PAPERS • NO BENCHMARKS YET
A diverse dataset of human faces, including unconventional poses, occluded faces, and a vast variability in backgrounds.
4 PAPERS • NO BENCHMARKS YET
The Inpainting dataset consists of synchronized Labeled image and LiDAR scanned point clouds. It's captured by HESAI Pandora All-in-One Sensing Kit. It is collected under various lighting conditions and traffic densities in Beijing, China.
1 PAPER • 1 BENCHMARK
QST contains 1,167 video clips that are cut out from 216 time-lapse 4K videos collected from YouTube, which can be used for a variety of tasks, such as (high-resolution) video generation, (high-resolution) video prediction, (high-resolution) image generation, texture generation, image inpainting, image/video super-resolution, image/video colorization, image/video animating, etc. Each short clip contains multiple frames (from a minimum of 58 frames to a maximum of 1,200 frames, a total of 285,446 frames), and the resolution of each frame is more than 1,024 x 1,024. Specifically, QST consists of a training set (containing 1000 clips, totally 244,930 frames), a validation set (containing 100 clips, totally 23,200 frames), and a testing set (containing 67 clips, totally 17,316 frames). Click here (Key: qst1) to download the QST dataset.
1 PAPER • NO BENCHMARKS YET