🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

67 dataset results for Image Generation

CIFAR-10 (Canadian Institute for Advanced Research, 10 classes)

The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). There are 6000 images per class with 5000 training and 1000 testing images per class.

14,073 PAPERS • 98 BENCHMARKS

ImageNet

The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

13,419 PAPERS • 40 BENCHMARKS

CIFAR-100

The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. There are 600 images per class. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 500 training images and 100 testing images per class.

7,645 PAPERS • 52 BENCHMARKS

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

6,975 PAPERS • 52 BENCHMARKS

Cityscapes

Cityscapes is a large-scale database which focuses on semantic understanding of urban street scenes. It provides semantic, instance-wise, and dense pixel annotations for 30 classes grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset consists of around 5000 fine annotated images and 20000 coarse annotated ones. Data was captured in 50 cities during several months, daytimes, and good weather conditions. It was originally recorded as video so the frames were manually selected to have the following features: large number of dynamic objects, varying scene layout, and varying background.

3,319 PAPERS • 54 BENCHMARKS

CelebA (CelebFaces Attributes Dataset)

CelebFaces Attributes dataset contains 202,599 face images of the size 178×218 from 10,177 celebrities, each annotated with 40 binary labels indicating facial attributes like hair color, gender and age.

3,081 PAPERS • 20 BENCHMARKS

Fashion-MNIST

Fashion-MNIST is a dataset comprising of 28×28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST shares the same image size, data format and the structure of training and testing splits with the original MNIST.

2,779 PAPERS • 17 BENCHMARKS

CUB-200-2011 (Caltech-UCSD Birds-200-2011)

The Caltech-UCSD Birds-200-2011 (CUB-200-2011) dataset is the most widely-used dataset for fine-grained visual categorization task. It contains 11,788 images of 200 subcategories belonging to birds, 5,994 for training and 5,794 for testing. Each image has detailed annotations: 1 subcategory label, 15 part locations, 312 binary attributes and 1 bounding box. The textual information comes from Reed et al.. They expand the CUB-200-2011 dataset by collecting fine-grained natural language descriptions. Ten single-sentence descriptions are collected for each image. The natural language descriptions are collected through the Amazon Mechanical Turk (AMT) platform, and are required at least 10 words, without any information of subcategories and actions.

1,956 PAPERS • 44 BENCHMARKS

FFHQ (Flickr-Faces-HQ)

Flickr-Faces-HQ (FFHQ) consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc. The images were crawled from Flickr, thus inheriting all the biases of that website, and automatically aligned and cropped using dlib. Only images under permissive licenses were collected. Various automatic filters were used to prune the set, and finally Amazon Mechanical Turk was used to remove the occasional statues, paintings, or photos of photos.

1,220 PAPERS • 16 BENCHMARKS

Oxford 102 Flower (102 Category Flower Dataset)

Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images.

1,039 PAPERS • 14 BENCHMARKS

STL-10 (Self-Taught Learning 10)

The STL-10 is an image dataset derived from ImageNet and popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes (such as birds, cats, trucks), among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with 96×96 pixels in size.

957 PAPERS • 17 BENCHMARKS

CelebA-HQ

The CelebA-HQ dataset is a high-quality version of CelebA that consists of 30,000 images at 1024×1024 resolution.

808 PAPERS • 13 BENCHMARKS

LSUN (Large-scale Scene UNderstanding Challenge)

The Large-scale Scene Understanding (LSUN) challenge aims to provide a different benchmark for large-scale scene classification and understanding. The LSUN classification dataset contains 10 scene categories, such as dining room, bedroom, chicken, outdoor church, and so on. For training data, each category contains a huge number of images, ranging from around 120,000 to 3,000,000. The validation data includes 300 images, and the test data has 1000 images for each category.

796 PAPERS • 10 BENCHMARKS

Stanford Cars

The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. The data is divided into almost a 50-50 train/test split with 8,144 training images and 8,041 testing images. Categories are typically at the level of Make, Model, Year. The images are 360×240.

644 PAPERS • 10 BENCHMARKS

CLEVR (Compositional Language and Elementary Visual Reasoning)

CLEVR (Compositional Language and Elementary Visual Reasoning) is a synthetic Visual Question Answering dataset. It contains images of 3D-rendered objects; each image comes with a number of highly compositional questions that fall into different categories. Those categories fall into 5 classes of tasks: Exist, Count, Compare Integer, Query Attribute and Compare Attribute. The CLEVR dataset consists of: a training set of 70k images and 700k questions, a validation set of 15k images and 150k questions, a test set of 15k images and 150k questions about objects, answers, scene graphs and functional programs for all train and validation images and questions. Each object present in the scene, aside of position, is characterized by a set of four attributes: 2 sizes: large, small, 3 shapes: square, cylinder, sphere, 2 material types: rubber, metal, 8 color types: gray, blue, brown, yellow, red, green, purple, cyan, resulting in 96 unique combinations.

577 PAPERS • 2 BENCHMARKS

iNaturalist

The iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories. Those categories belong to 13 super-categories including Plantae (Plant), Insecta (Insect), Aves (Bird), Mammalia (Mammal), and so on. The iNat dataset is highly imbalanced with dramatically different number of images per category. For example, the largest super-category “Plantae (Plant)” has 196,613 images from 2,101 categories; whereas the smallest super-category “Protozoa” only has 381 images from 4 categories.

495 PAPERS • 12 BENCHMARKS

Perceptual Similarity

Perceptual Similarity is a dataset of human perceptual similarity judgments.

331 PAPERS • NO BENCHMARKS YET

FaceForensics++

FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures. The data has been sourced from 977 youtube videos and all videos contain a trackable mostly frontal face without occlusions which enables automated tampering methods to generate realistic forgeries.

311 PAPERS • 2 BENCHMARKS

Replica

The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, planar segmentation as well as semantic class and instance segmentation.

279 PAPERS • 3 BENCHMARKS

AFHQ (Animal Faces-HQ)

Animal FacesHQ (AFHQ) is a dataset of animal faces consisting of 15,000 high-quality images at 512 × 512 resolution. The dataset includes three domains of cat, dog, and wildlife, each providing 5000 images. By having multiple (three) domains and diverse images of various breeds (≥ eight) per each domain, AFHQ sets a more challenging image-to-image translation problem. All images are vertically and horizontally aligned to have the eyes at the center. The low-quality images were discarded by human effort.

264 PAPERS • 6 BENCHMARKS

EMNIST (Extended MNIST)

EMNIST (extended MNIST) has 4 times more data than MNIST. It is a set of handwritten digits with a 28 x 28 format.

234 PAPERS • 9 BENCHMARKS

DensePose (DensePose-COCO)

DensePose-COCO is a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K COCO images and train DensePose-RCNN, to densely regress part-specific UV coordinates within every human region at multiple frames per second.

230 PAPERS • NO BENCHMARKS YET

VizDoom

ViZDoom is an AI research platform based on the classical First Person Shooter game Doom. The most popular game mode is probably the so-called Death Match, where several players join in a maze and fight against each other. After a fixed time, the match ends and all the players are ranked by the FRAG scores defined as kills minus suicides. During the game, each player can access various observations, including the first-person view screen pixels, the corresponding depth-map and segmentation-map (pixel-wise object labels), the bird-view maze map, etc. The valid actions include almost all the keyboard-stroke and mouse-control a human player can take, accounting for moving, turning, jumping, shooting, changing weapon, etc. ViZDoom can run a game either synchronously or asynchronously, indicating whether the game core waits until all players’ actions are collected or runs in a constant frame rate without waiting.

149 PAPERS • 3 BENCHMARKS

CelebAMask-HQ

CelebAMask-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has segmentation mask of facial attributes corresponding to CelebA.

141 PAPERS • 4 BENCHMARKS

Structured3D

Structured3D is a large-scale photo-realistic dataset containing 3.5K house designs (a) created by professional designers with a variety of ground truth 3D structure annotations (b) and generate photo-realistic 2D images (c). The dataset consists of rendering images and corresponding ground truth annotations (e.g., semantic, albedo, depth, surface normal, layout) under different lighting and furniture configurations.

65 PAPERS • 1 BENCHMARK

MetFaces

MetFaces is an image dataset of human faces extracted from works of art. The dataset consists of 1336 high-quality PNG images at 1024×1024 resolution. The images were downloaded via the Metropolitan Museum of Art Collection API, and automatically aligned and cropped using dlib. Various automatic filters were used to prune the set.

64 PAPERS • 2 BENCHMARKS

Recipe1M+

Recipe1M+ is a dataset which contains one million structured cooking recipes with 13M associated images.

62 PAPERS • 3 BENCHMARKS

LLVIP (A Visible-infrared Paired Dataset for Low-light Vision)

Visible-infrared Paired Dataset for Low-light Vision 30976 images (15488 pairs) 24 dark scenes, 2 daytime scenes Support for image-to-image translation (visible to infrared, or infrared to visible), visible and infrared image fusion, low-light pedestrian detection, and infrared pedestrian detection (The original image and video pairs (before registration) of LLVIP are also released!)

49 PAPERS • 6 BENCHMARKS

Stanford Dogs

The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into 12,000 images for training and 8,580 images for testing.

46 PAPERS • 5 BENCHMARKS

Stacked MNIST

The Stacked MNIST dataset is derived from the standard MNIST dataset with an increased number of discrete modes. 240,000 RGB images in the size of 32×32 are synthesized by stacking three random digit images from MNIST along the color channel, resulting in 1,000 explicit modes in a uniform distribution corresponding to the number of possible triples of digits.

43 PAPERS • 1 BENCHMARK

GVGAI (General Video Game AI)

The General Video Game AI (GVGAI) framework is widely used in research which features a corpus of over 100 single-player games and 60 two-player games. These are fairly small games, each focusing on specific mechanics or skills the players should be able to demonstrate, including clones of classic arcade games such as Space Invaders, puzzle games like Sokoban, adventure games like Zelda or game-theory problems such as the Iterative Prisoners Dilemma. All games are real-time and require players to make decisions in only 40ms at every game tick, although not all games explicitly reward or require fast reactions; in fact, some of the best game-playing approaches add up the time in the beginning of the game to run Breadth-First Search in puzzle games in order to find an accurate solution. However, given the large variety of games (many of which are stochastic and difficult to predict accurately), scoring systems and termination conditions, all unknown to the players, highly-adaptive genera

34 PAPERS • NO BENCHMARKS YET

VLN-CE (Vision-and-Language Navigation in Continuous Environments)

Vision and Language Navigation in Continuous Environments (VLN-CE) is an instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained agent navigation. The dataset consists of 4475 trajectories converted from Room-to-Room train and validation splits. For each trajectory, multiple natural language instructions from Room-to-Room and a pre-computed shortest path are provided following the waypoints via low-level actions.

34 PAPERS • 1 BENCHMARK

ARKitScenes

ARKitScenes is an RGB-D dataset captured with the widely available Apple LiDAR scanner. Along with the per-frame raw data (Wide Camera RGB, Ultra Wide camera RGB, LiDar scanner depth, IMU) the authors also provide the estimated ARKit camera pose and ARKit scene reconstruction for each iPad Pro sequence. In addition to the raw and processed data from the mobile device, ARKit.

31 PAPERS • 1 BENCHMARK

Fashion-Gen

Fashion-Gen consists of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles.

30 PAPERS • NO BENCHMARKS YET

UT Zappos50K

UT Zappos50K is a large shoe dataset consisting of 50,025 catalog images collected from Zappos.com. The images are divided into 4 major categories — shoes, sandals, slippers, and boots — followed by functional types and individual brands. The shoes are centered on a white background and pictured in the same orientation for convenient analysis.

30 PAPERS • 1 BENCHMARK

Multi-Modal CelebA-HQ

Multi-Modal-CelebA-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has high-quality segmentation mask, sketch, descriptive text, and image with transparent background.

27 PAPERS • 3 BENCHMARKS

Multi-dSprites

27 PAPERS • 1 BENCHMARK

LHQ (Landscapes High-Quality)

A dataset of 90,000 high-resolution nature landscape images, crawled from Unsplash and Flickr and preprocessed with Mask R-CNN and Inception V3.

24 PAPERS • 4 BENCHMARKS

ShapeStacks

A simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability.

21 PAPERS • 2 BENCHMARKS

MMAct

MMAct is a large-scale dataset for multi/cross modal action understanding. This dataset has been recorded from 20 distinct subjects with seven different types of modalities: RGB videos, keypoints, acceleration, gyroscope, orientation, Wi-Fi and pressure signal. The dataset consists of more than 36k video clips for 37 action classes covering a wide range of daily life activities such as desktop-related and check-in-based ones in four different distinct scenarios.

18 PAPERS • 1 BENCHMARK

ActivityNet Entities

ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe.

15 PAPERS • NO BENCHMARKS YET

BCI (Breast Cancer Immunohistochemical Image Generation)

The evaluation of human epidermal growth factor receptor 2 (HER2) expression is essential to formulate a precise treatment for breast cancer. The routine evaluation of HER2 is conducted with immunohistochemical techniques (IHC), which is very expensive. Therefore, we propose a breast cancer immunohistochemical (BCI) benchmark attempting to synthesize IHC data directly with the paired hematoxylin and eosin (HE) stained images. The dataset contains 4870 registered image pairs, covering a variety of HER2 expression levels (0, 1+, 2+, 3+).

10 PAPERS • 1 BENCHMARK

Binarized MNIST

A binarized version of MNIST.

10 PAPERS • 1 BENCHMARK

StyleGAN-Human

A large-scale human image dataset with over 230K samples capturing diverse poses and textures.

10 PAPERS • NO BENCHMARKS YET

ArtBench-10 (32x32)

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32×32, 256×256, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks.

7 PAPERS • 1 BENCHMARK

KMNIST

7 PAPERS • 2 BENCHMARKS

YouTube Driving

YouTube Driving Dataset contains a massive amount of real-world driving frames with various conditions, from different weather, different regions, to diverse scene types

7 PAPERS • 1 BENCHMARK

ChineseFoodNet

ChineseFoodNet aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In the dataset, images of each food category of the dataset consists of not only web recipe and menu pictures but photos taken from real dishes, recipe and menu as well. ChineseFoodNet contains over 180,000 food photos of 208 categories, with each category covering a large variations in presentations of same Chinese food.

6 PAPERS • NO BENCHMARKS YET

Datasets

67 dataset results for Image Generation