The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
4,176 PAPERS • 37 BENCHMARKS
The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
517 PAPERS • 11 BENCHMARKS
The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding images captured by a high-speed motion capture system. There are 4 high-resolution progressive scan cameras to acquire video data at 50 Hz. The dataset contains activities by 11 professional actors in 17 scenarios: discussion, smoking, taking photo, talking on the phone, etc., as well as provides accurate 3D joint positions and high-resolution videos.
330 PAPERS • 10 BENCHMARKS
The efforts to create a non-trivial and publicly available dataset for action recognition was initiated at the KTH Royal Institute of Technology in 2004. The KTH dataset is one of the most standard datasets, which contains six actions: walk, jog, run, box, hand-wave, and hand clap. To account for performance nuance, each action is performed by 25 different individuals, and the setting is systematically altered for each action per actor. Setting variations include: outdoor (s1), outdoor with scale variation (s2), outdoor with different clothes (s3), and indoor (s4). These variations test the ability of each algorithm to identify actions independent of the background, appearance of the actors, and the scale of the actors.
179 PAPERS • 1 BENCHMARK
The Moving MNIST dataset contains 10,000 video sequences, each consisting of 20 frames. In each video sequence, two digits move independently around the frame, which has a spatial resolution of 64×64 pixels. The digits frequently intersect with each other and bounce off the edges of the frame
119 PAPERS • 1 BENCHMARK
The YouTube-8M dataset is a large scale video dataset, which includes more than 7 million videos with 4716 classes labeled by the annotation system. The dataset consists of three parts: training set, validate set, and test set. In the training set, each class contains at least 100 training videos. Features of these videos are extracted by the state-of-the-art popular pre-trained models and released for public use. Each video contains audio and visual modality. Based on the visual information, videos are divided into 24 topics, such as sports, game, arts & entertainment, etc
96 PAPERS • 2 BENCHMARKS
The Kinetics-600 is a large-scale action recognition dataset which consists of around 480K videos from 600 action categories. The 480K videos are divided into 390K, 30K, 60K for training, validation and test sets, respectively. Each video in the dataset is a 10-second clip of action moment annotated from raw YouTube video. It is an extensions of the Kinetics-400 dataset.
45 PAPERS • 6 BENCHMARKS
The Sprites dataset contains 60 pixel color images of animated characters (sprites). There are 672 sprites, 500 for training, 100 for testing and 72 for validation. Each sprite has 20 animations and 178 images, so the full dataset has 120K images in total. There are many changes in the appearance of the sprites, they differ in their body shape, gender, hair, armor, arm type, greaves, and weapon.
41 PAPERS • 3 BENCHMARKS
Benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
12 PAPERS • 2 BENCHMARKS
The Robotic Pushing Dataset is a dataset for video prediction for real-world interactive agents which consists of 59,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action.
11 PAPERS • NO BENCHMARKS YET
A satellite-based dataset called "CloudCast". It consists of 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928 × 1530 pixels (3 × 3 km per pixel) with 15-min intervals between frames for the period January 1, 2017, to December 31, 2018. All frames are centered and projected over Europe.
1 PAPER • NO BENCHMARKS YET
Satellite images are snapshots of the Earth surface. We propose to forecast them. We frame Earth surface forecasting as the task of predicting satellite imagery conditioned on future weather. EarthNet2021 is a large dataset suitable for training deep neural networks on the task. It contains Sentinel~2 satellite imagery at $20$~m resolution, matching topography and mesoscale ($1.28$~km) meteorological variables packaged into $32000$ samples. Additionally we frame EarthNet2021 as a challenge allowing for model intercomparison. Resulting forecasts will greatly improve ($>\times50$) over the spatial resolution found in numerical models. This allows localized impacts from extreme weather to be predicted, thus supporting downstream applications such as crop yield prediction, forest health assessments or biodiversity monitoring. Find data, code, and how to participate at www.earthnet.tech.
1 PAPER • NO BENCHMARKS YET
Indian Institute of Science VIdeo Naturalness Evaluation (IISc VINE) is a database consisting of 300 videos, obtained by applying different prediction models on different datasets, and accompanying human opinion scores.
1 PAPER • NO BENCHMARKS YET