The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
9,882 PAPERS • 96 BENCHMARKS
The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
5,861 PAPERS • 49 BENCHMARKS
The CIFAR-100 dataset (Canadian Institute for Advanced Research, 100 classes) is a subset of the Tiny Images dataset and consists of 60000 32x32 color images. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. There are 600 images per class. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 500 training images and 100 testing images per class.
5,205 PAPERS • 39 BENCHMARKS
Oxford 102 Flower is an image classification dataset consisting of 102 flower categories. The flowers chosen to be flower commonly occurring in the United Kingdom. Each class consists of between 40 and 258 images.
618 PAPERS • 14 BENCHMARKS
The Stanford Cars dataset consists of 196 classes of cars with a total of 16,185 images, taken from the rear. The data is divided into almost a 50-50 train/test split with 8,144 training images and 8,041 testing images. Categories are typically at the level of Make, Model, Year. The images are 360×240.
386 PAPERS • 8 BENCHMARKS
The Sketch dataset contains over 20,000 sketches evenly distributed over 250 object categories.
176 PAPERS • 1 BENCHMARK
Permuted MNIST is an MNIST variant that consists of 70,000 images of handwritten digits from 0 to 9, where 60,000 images are used for training, and 10,000 images for test. The difference of this dataset from the original MNIST is that each of the ten tasks is the multi-class classification of a different random permutation of the input pixels.
96 PAPERS • 2 BENCHMARKS
CORe50 is a dataset designed for assessing Continual Learning techniques in an Object Recognition context.
77 PAPERS • NO BENCHMARKS YET
WikiArt contains painting from 195 different artists. The dataset has 42129 images for training and 10628 images for testing.
46 PAPERS • 2 BENCHMARKS
With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure the tweets are meaningful and contain interesting information, tweets used by journalists to write news articles are gathered. Then human annotators are asked to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, the answer are allowed to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.
12 PAPERS • 1 BENCHMARK
A set of 19 ASC datasets (reviews of 19 products) producing a sequence of 19 tasks. Each dataset represents a task. The datasets are from 4 sources: (1) HL5Domains (Hu and Liu, 2004) with reviews of 5 products; (2) Liu3Domains (Liu et al., 2015) with reviews of 3 products; (3) Ding9Domains (Ding et al., 2008) with reviews of 9 products; and (4) SemEval14 with reviews of 2 products - SemEval 2014 Task 4 for laptop and restaurant. For (1), (2) and (3), we split about 10% of the original data as the validate data, another about 10% of the original data as the testing data. For (4), We use 150 examples from the training set for validation. To be consistent with existing research(Tang et al., 2016), examples belonging to the conflicting polarity (both positive and negative sentiments are expressed about an aspect term) are not used. Statistics and details of the 19 datasets are given on Page https://github.com/ZixuanKe/PyContinual.
11 PAPERS • 1 BENCHMARK
This dataset has 20 classes and each class has about 1000 documents. The data split for train/validation/test is 1600/200/200. We created 10 tasks, 2 classes per task. Since this is topic-based text classification data, the classes are very different and have little shared knowledge. As mentioned above, this application (and dataset) is mainly used to show a CL model's ability to overcome forgetting. Detailed statistics please on page https://github.com/ZixuanKe/PyContinual
10 PAPERS • 1 BENCHMARK
The goal of this challenge is to solve simultaneously ten image classification problems representative of very different visual domains. The data for each domain is obtained from the following image classification benchmarks:
8 PAPERS • 1 BENCHMARK
Continual World is a benchmark consisting of realistic and meaningfully diverse robotic tasks built on top of Meta-World as a testbed.
6 PAPERS • NO BENCHMARKS YET
A set of 10 DSC datasets (reviews of 10 products) to produce sequences of tasks. The products are Sports, Toys, Tools, Video, Pet, Musical, Movies, Garden, Offices, and Kindle. 2500 positive and 2500 negative training reviews per task . The validation reviews are with 250 positive and 250 negative and the test reviews are with 250 positive and 250 negative reviews. The detailed statistic on page https://github.com/ZixuanKe/PyContinual
6 PAPERS • 1 BENCHMARK
F-CelebA - This dataset is adapted from federated learning. Federated learning is an emerging machine learning paradigm with an emphasis on data privacy. The idea is to train through model aggregation rather than conventional data aggregation and keep local data staying on the local device. This dataset naturally consists of similar tasks and each of the 10 tasks contains images of a celebrity labeled by whether he/she is smiling or not. More detailed please check page https://github.com/ZixuanKe/CAT
ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event.
HASY is a dataset of single symbols similar to MNIST. It contains 168,233 instances of 369 classes. HASY contains two challenges: A classification challenge with 10 pre-defined folds for 10-fold cross-validation and a verification challenge.
3 PAPERS • NO BENCHMARKS YET
(L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition Dataset (OpenLORIS-Object) is designed for accelerating the lifelong/continual/incremental learning research and application,currently focusing on improving the continuous learning capability of the common objects in the home scenario.
2 PAPERS • NO BENCHMARKS YET
BeGin provides 23 benchmark scenarios for graph from 14 real-world datasets, which cover 12 combinations of the incremental settings and the levels of problem. In addition, BeGin provides various basic evaluation metrics for measuring the performances and final evalution metrics designed for continual learning.
1 PAPER • NO BENCHMARKS YET
Provides two large-scale multi-step benchmarks for biometric identification, where the visual appearance of different classes are highly relevant.
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile robots operating in a changing environment (like a household), where it is important to learn new, never seen objects on the fly. This dataset can also be used for other learning use-cases, like instance segmentation or depth estimation. Or where household objects or continual learning are of interest.
1 PAPER • 2 BENCHMARKS
TemporalWiki is a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM's ability to retain previous knowledge and acquire updated/new knowledge at each point in time.
Wild-Time is a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.