The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
11,378 PAPERS • 96 BENCHMARKS
The MPII Human Pose Dataset for single person pose estimation is composed of about 25K images of which 15K are training samples, 3K are validation samples and 7K are testing samples (which labels are withheld by the authors). The images are taken from YouTube videos covering 410 different human activities and the poses are manually annotated with up to 16 body joints.
487 PAPERS • 4 BENCHMARKS
The Pascal3D+ multi-view dataset consists of images in the wild, i.e., images of object categories exhibiting high variability, captured under uncontrolled settings, in cluttered scenes and under many different poses. Pascal3D+ contains 12 categories of rigid objects selected from the PASCAL VOC 2012 dataset. These objects are annotated with pose information (azimuth, elevation and distance to camera). Pascal3D+ also adds pose annotated images of these 12 categories from the ImageNet dataset.
236 PAPERS • 1 BENCHMARK
This dataset focuses on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains 13,360 elaborately annotated human instances within 5081 images. With average 0.573 MaxIoU of each person, OCHuman is the most complex and challenging dataset related to human.
59 PAPERS • 5 BENCHMARKS
KeypointNet is a large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations, based on ShapeNet models.
24 PAPERS • NO BENCHMARKS YET
ApolloCar3DT is a dataset that contains 5,277 driving images and over 60K car instances, where each car is fitted with an industry-grade 3D CAD model with absolute model size and semantically labelled keypoints. This dataset is above 20 times larger than PASCAL3D+ and KITTI, the current state-of-the-art.
17 PAPERS • 14 BENCHMARKS
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
13 PAPERS • 8 BENCHMARKS
AwA Pose is a large scale animal keypoint dataset with ground truth annotations for keypoint detection of quadruped animals from images.
4 PAPERS • NO BENCHMARKS YET
The researchers collected 3,500 images of Tilapia fish, with each image containing three fish in a small bowl. These images were manually annotated using Roboflow, a tool for creating and managing annotated datasets. Four keypoints were labeled on each fish: mouth, peduncle, belly, and back. While the primary goal was to measure fish length using the mouth and peduncle points, the additional keypoints (belly and back) were included to support potential future research, such as using girth to determine fish weight. This dataset was used to train a YOLOv8 model for keypoint detection, achieving high accuracy in identifying these crucial points on the Tilapia fish.
1 PAPER • NO BENCHMARKS YET
A multimodal dataset of radio galaxies and their corresponding infrared hosts.
Automating the creation of catalogues for radio galaxies in next-generation deep surveys necessitates the identification of components within extended sources and their respective infrared hosts. We present RadioGalaxyNET, a multimodal dataset, tailored for machine learning tasks to streamline the automated detection and localization of multi-component extended radio galaxies and their associated infrared hosts. The dataset encompasses 4,155 instances of galaxies across 2,800 images, incorporating both radio and infrared channels. Each instance furnishes details about the extended radio galaxy class, a bounding box covering all components, a pixel-level segmentation mask, and the keypoint position of the corresponding infrared host galaxy. RadioGalaxyNET is the first dataset to include images from the highly sensitive Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope, corresponding infrared images, and instance-level annotations for galaxy detection.
1 PAPER • 1 BENCHMARK
TAMPAR is a real-world dataset of parcel photos for tampering detection with annotations in COCO format. For details see the paper and for visual samples the project page. Features are:
The ViCoS Towel Dataset is a state-of-the-art benchmark for grasp point localization on cloth objects, specifically towels. Designed to advance research in robotic grasping and perception for textile objects, this dataset includes a collection of 8,000 high-resolution RGB-D images (1920×1080) captured with a Kinect V2 under a variety of conditions. Each image provides detailed depth information, making it ideal for training deep learning models and conducting thorough benchmarking.