The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.
13,520 PAPERS • 41 BENCHMARKS
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,222 PAPERS • 93 BENCHMARKS
The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is to do automated chest x-ray interpretation, featuring uncertainty labels and radiologist-labeled reference standard evaluation sets.
514 PAPERS • 1 BENCHMARK
ImageNet-C is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set.
513 PAPERS • 3 BENCHMARKS
Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, and curved text instances. The training split and testing split have 1,255 images and 300 images, respectively.
143 PAPERS • 2 BENCHMARKS
JFT-300M is an internal Google dataset used for training image classification models. Images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images.
119 PAPERS • 1 BENCHMARK
The MSRA-TD500 dataset is a text detection dataset that contains 300 training images and 200 test images. Text regions are arbitrarily orientated and annotated at sentence level. Different from the other datasets, it contains both English and Chinese text.
The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. Each activity is performed by four different people, thus totally 28 videos. For each video, there are about 20 fine-grained action instances such as take bread, pour ketchup, in approximately one minute.
105 PAPERS • 2 BENCHMARKS
The CrowdPose dataset contains about 20,000 images and a total of 80,000 human poses with 14 labeled keypoints. The test set includes 8,000 images. The crowded images containing homes are extracted from MSCOCO, MPII and AI Challenger.
85 PAPERS • 2 BENCHMARKS
The KVASIR Dataset was released as part of the medical multimedia challenge presented by MediaEval. It is based on images obtained from the GI tract via an endoscopy procedure. The dataset is composed of images that are annotated and verified by medical doctors, and captures 8 different classes. The classes are based on three anatomical landmarks (z-line, pylorus, cecum), three pathological findings (esophagitis, polyps, ulcerative colitis) and two other classes (dyed and lifted polyps, dyed resection margins) related to the polyp removal process. Overall, the dataset contains 8,000 endoscopic images, with 1,000 image examples per class.
85 PAPERS • 3 BENCHMARKS
The LUNA16 (LUng Nodule Analysis) dataset is a dataset for lung segmentation. It consists of 1,186 lung nodules annotated in 888 CT scans.
84 PAPERS • 1 BENCHMARK
MPIIGaze is a dataset for appearance-based gaze estimation in the wild. It contains 213,659 images collected from 15 participants during natural everyday laptop use over more than three months. It has a large variability in appearance and illumination.
76 PAPERS • 1 BENCHMARK
MuPoTs-3D (Multi-person Pose estimation Test Set in 3D) is a dataset for pose estimation composed of more than 8,000 frames from 20 real-world scenes with up to three subjects. The poses are annotated with a 14-point skeleton model.
66 PAPERS • 4 BENCHMARKS
The LIP (Look into Person) dataset is a large-scale dataset focusing on semantic understanding of a person. It contains 50,000 images with elaborated pixel-wise annotations of 19 semantic human part labels and 2D human poses with 16 key points. The images are collected from real-world scenarios and the subjects appear with challenging poses and view, heavy occlusions, various appearances and low resolution.
59 PAPERS • 1 BENCHMARK
PA-100K is a recent-proposed large pedestrian attribute dataset, with 100,000 images in total collected from outdoor surveillance cameras. It is split into 80,000 images for the training set, and 10,000 for the validation set and 10,000 for the test set. This dataset is labeled by 26 binary attributes. The common features existing in both selected dataset is that the images are blurry due to the relatively low resolution and the positive ratio of each binary attribute is low.
46 PAPERS • 1 BENCHMARK
CityFlow is a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km. The dataset contains more than 200K annotated bounding boxes covering a wide range of scenes, viewing angles, vehicle models, and urban traffic flow conditions.
43 PAPERS • 2 BENCHMARKS
The SCUT-CTW1500 dataset contains 1,500 images: 1,000 for training and 500 for testing. In particular, it provides 10,751 cropped text instance images, including 3,530 with curved text. The images are manually harvested from the Internet, image libraries such as Google Open-Image, or phone cameras. The dataset contains a lot of horizontal and multi-oriented text.
41 PAPERS • 3 BENCHMARKS
CVC-ClinicDB is an open-access dataset of 612 images with a resolution of 384×288 from 31 colonoscopy sequences.It is used for medical image segmentation, in particular polyp detection in colonoscopy videos.
39 PAPERS • 1 BENCHMARK
LiTS17 is a liver tumor segmentation benchmark. The data and segmentations are provided by various clinical sites around the world. The training data set contains 130 CT scans and the test data set 70 CT scans. Image Source: https://arxiv.org/pdf/1707.07734.pdf
39 PAPERS • 3 BENCHMARKS
SLAKE is an English-Chinese bilingual dataset consisting of 642 images and 14,028 question-answer pairs for training and testing Med-VQA systems.
31 PAPERS • 1 BENCHMARK
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
28 PAPERS • NO BENCHMARKS YET
The ISIC 2018 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. This Task 1 dataset is the challenge on lesion segmentation. It includes 2594 images.
23 PAPERS • 1 BENCHMARK
The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).e
22 PAPERS • NO BENCHMARKS YET
The Chinese City Parking Dataset (CCPD) is a dataset for license plate detection and recognition. It contains over 250k unique car images, with license plate location annotations.
21 PAPERS • NO BENCHMARKS YET
The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
21 PAPERS • 13 BENCHMARKS
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
20 PAPERS • 3 BENCHMARKS
Multicultural Reasoning over Vision and Language (MaRVL) is a dataset based on an ImageNet-style hierarchy representative of many languages and cultures (Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish). The selection of both concepts and images is entirely driven by native speakers. Afterwards, we elicit statements from native speakers about pairs of images. The task consists in discriminating whether each grounded statement is true or false.
18 PAPERS • 3 BENCHMARKS
CASIA-HWDB is a dataset for handwritten Chinese character recognition. It contains 300 files (240 in HWDB1.1 training set and 60 in HWDB1.1 test set). Each file contains about 3000 isolated gray-scale Chinese character images written by one writer, as well as their corresponding labels.
17 PAPERS • NO BENCHMARKS YET
EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances.
15 PAPERS • 2 BENCHMARKS
Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.
12 PAPERS • 2 BENCHMARKS
Dataset Introduction
11 PAPERS • 1 BENCHMARK
Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.
11 PAPERS • NO BENCHMARKS YET
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
10 PAPERS • NO BENCHMARKS YET
High-resolution ship collections 2016 (HRSC2016) is a data set used for scientific research. Currently, all of the images in HRSC2016 were collected from Google Earth.
OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.
8 PAPERS • 1 BENCHMARK
Wukong is a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods to facilitate the Vision-Language Pre-training (VLP). This dataset contains 100 million Chinese image-text pairs from the web. This base query list is taken from and is filtered according to the frequency of Chinese words and phrases.
8 PAPERS • NO BENCHMARKS YET
ChineseFoodNet aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In the dataset, images of each food category of the dataset consists of not only web recipe and menu pictures but photos taken from real dishes, recipe and menu as well. ChineseFoodNet contains over 180,000 food photos of 208 categories, with each category covering a large variations in presentations of same Chinese food.
6 PAPERS • NO BENCHMARKS YET
The Chinese Academy of Sciences Micro-Expression dataset (CASME II) consists of 255 videos, elicited from 26 participants. The videos are recorded using Point Gray GRAS-03K2C camera which has a frame rate of 200fps. The average video length is 0.34s, equivalent to 68 frames. Each video’s emotion label is annotated by two coders, where the reliability is 0.846.
5 PAPERS • 1 BENCHMARK
Chinese Text in the Wild is a dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.
5 PAPERS • NO BENCHMARKS YET
DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:
Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs.
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.
5 PAPERS • 7 BENCHMARKS
Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment!
The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.
4 PAPERS • 1 BENCHMARK
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.
3 PAPERS • 1 BENCHMARK
A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)
3 PAPERS • NO BENCHMARKS YET