🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

261 dataset results for Object Detection

MS COCO (Microsoft Common Objects in Context)

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

10,147 PAPERS • 92 BENCHMARKS

KITTI

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome

3,219 PAPERS • 141 BENCHMARKS

nuScenes

The nuScenes dataset is a large-scale autonomous driving dataset. The dataset has 3D bounding boxes for 1000 scenes collected in Boston and Singapore. Each scene is 20 seconds long and annotated at 2Hz. This results in a total of 28130 samples for training, 6019 samples for validation and 6008 samples for testing. The dataset has the full autonomous vehicle data suite: 32-beam LiDAR, 6 cameras and radars with complete 360° coverage. The 3D object detection challenge evaluates the performance on 10 classes: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles, traffic cones and barriers.

1,549 PAPERS • 20 BENCHMARKS

Visual Genome

Visual Genome contains Visual Question Answering data in a multi-choice setting. It consists of 101,174 images from MSCOCO with 1.7 million QA pairs, 17 questions per image on average. Compared to the Visual Question Answering dataset, Visual Genome represents a more balanced distribution over 6 question types: What, Where, When, Who, Why and How. The Visual Genome dataset also presents 108K images with densely annotated objects, attributes and relationships.

1,140 PAPERS • 19 BENCHMARKS

LVIS

LVIS is a dataset for long tail instance segmentation. It has annotations for over 1000 object categories in 164k images.

434 PAPERS • 14 BENCHMARKS

SUN RGB-D

The SUN RGBD dataset contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. As many as 700 object categories are labeled. The training and testing sets contain 5285 and 5050 images, respectively.

422 PAPERS • 13 BENCHMARKS

Waymo Open Dataset

The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions.

373 PAPERS • 12 BENCHMARKS

BDD100K

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in thi

359 PAPERS • 16 BENCHMARKS

MVTecAD (MVTEC ANOMALY DETECTION DATASET)

MVTec AD is a dataset for benchmarking anomaly detection methods with a focus on industrial inspection. It contains over 5000 high-resolution images divided into fifteen different object and texture categories. Each category comprises a set of defect-free training images and a test set of images with various kinds of defects as well as images without defects.

287 PAPERS • 4 BENCHMARKS

Manga109

Manga109 has been compiled by the Aizawa Yamasaki Matsui Laboratory, Department of Information and Communication Engineering, the Graduate School of Information Science and Technology, the University of Tokyo. The compilation is intended for use in academic research on the media processing of Japanese manga. Manga109 is composed of 109 manga volumes drawn by professional manga artists in Japan. These manga were commercially made available to the public between the 1970s and 2010s, and encompass a wide range of target readerships and genres (see the table in Explore for further details.) Most of the manga in the compilation are available at the manga library “Manga Library Z” (formerly the “Zeppan Manga Toshokan” library of out-of-print manga).

249 PAPERS • 12 BENCHMARKS

PASCAL3D+

The Pascal3D+ multi-view dataset consists of images in the wild, i.e., images of object categories exhibiting high variability, captured under uncontrolled settings, in cluttered scenes and under many different poses. Pascal3D+ contains 12 categories of rigid objects selected from the PASCAL VOC 2012 dataset. These objects are annotated with pose information (azimuth, elevation and distance to camera). Pascal3D+ also adds pose annotated images of these 12 categories from the ImageNet dataset.

231 PAPERS • 1 BENCHMARK

Foggy Cityscapes

Foggy Cityscapes is a synthetic foggy dataset which simulates fog on real scenes. Each foggy image is rendered with a clear image and depth map from Cityscapes. Thus the annotations and data split in Foggy Cityscapes are inherited from Cityscapes.

207 PAPERS • 6 BENCHMARKS

LabelMe

LabelMe database is a large collection of images with ground truth labels for object detection and recognition. The annotations come from two different sources, including the LabelMe online annotation tool.

167 PAPERS • 1 BENCHMARK

CrowdHuman

CrowdHuman is a large and rich-annotated human detection dataset, which contains 15,000, 4,370 and 5,000 images collected from the Internet for training, validation and testing respectively. The number is more than 10× boosted compared with previous challenging pedestrian detection dataset like CityPersons. The total number of persons is also noticeably larger than the others with ∼340k person and ∼99k ignore region annotations in the CrowdHuman training subset.

143 PAPERS • 2 BENCHMARKS

NoCaps

The nocaps benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets.

141 PAPERS • 13 BENCHMARKS

Kvasir-SEG

Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist.

140 PAPERS • 3 BENCHMARKS

PASCAL VOC (PASCAL Visual Object Classes Challenge)

The PASCAL Visual Object Classes (VOC) 2012 dataset contains 20 object categories including vehicles, household, animals, and other: aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. Each image in this dataset has pixel-level segmentation annotations, bounding box annotations, and object class annotations. This dataset has been widely used as a benchmark for object detection, semantic segmentation, and classification tasks. The PASCAL VOC dataset is split into three subsets: 1,464 images for training, 1,449 images for validation and a private testing set.

138 PAPERS • 22 BENCHMARKS

Objects365

Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community.

134 PAPERS • 3 BENCHMARKS

PASCAL VOC 2007

PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are:

119 PAPERS • 14 BENCHMARKS

CityPersons

The CityPersons dataset is a subset of Cityscapes which only consists of person annotations. There are 2975 images for training, 500 and 1575 images for validation and testing. The average of the number of pedestrians in an image is 7. The visible-region and full-body annotations are provided.

118 PAPERS • 2 BENCHMARKS

fMoW (Functional Map of the World)

Functional Map of the World (fMoW) is a dataset that aims to inspire the development of machine learning models capable of predicting the functional purpose of buildings and land use from temporal sequences of satellite images and a rich set of metadata features.

107 PAPERS • NO BENCHMARKS YET

PubLayNet

PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.

100 PAPERS • 1 BENCHMARK

Kvasir (The Kvasir Dataset)

The KVASIR Dataset was released as part of the medical multimedia challenge presented by MediaEval. It is based on images obtained from the GI tract via an endoscopy procedure. The dataset is composed of images that are annotated and verified by medical doctors, and captures 8 different classes. The classes are based on three anatomical landmarks (z-line, pylorus, cecum), three pathological findings (esophagitis, polyps, ulcerative colitis) and two other classes (dyed and lifted polyps, dyed resection margins) related to the polyp removal process. Overall, the dataset contains 8,000 endoscopic images, with 1,000 image examples per class.

85 PAPERS • 3 BENCHMARKS

ABC Dataset

The ABC Dataset is a collection of one million Computer-Aided Design (CAD) models for research of geometric deep learning methods and applications. Each model is a collection of explicitly parametrized curves and surfaces, providing ground truth for differential quantities, patch segmentation, geometric feature detection, and shape reconstruction. Sampling the parametric descriptions of surfaces and curves allows generating data in different formats and resolutions, enabling fair comparisons for a wide range of geometric learning algorithms.

83 PAPERS • NO BENCHMARKS YET

UAVDT (Unmanned Aerial Vehicle Benchmark Object Detection and Tracking)

UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours raw videos) for 3 important fundamental tasks, i.e., object DETection (DET), Single Object Tracking (SOT) and Multiple Object Tracking (MOT).

81 PAPERS • 1 BENCHMARK

xView (mejdi)

xView is one of the largest publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes. It contains over 1M object instances from 60 different classes.

81 PAPERS • 1 BENCHMARK

11k Hands

A large dataset of human hand images (dorsal and palmar sides) with detailed ground-truth information for gender recognition and biometric identification.

76 PAPERS • NO BENCHMARKS YET

ApolloScape

ApolloScape is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China under varying weather conditions. Pixel-wise semantic annotation of the recorded data is provided in 2D, with point-wise semantic annotation in 3D for 28 classes. In addition, the dataset contains lane marking annotations in 2D.

66 PAPERS • 5 BENCHMARKS

FSOD (Few-Shot Object Detection Dataset)

Few-Shot Object Detection Dataset (FSOD) is a high-diverse dataset specifically designed for few-shot object detection and intrinsically designed to evaluate thegenerality of a model on novel categories.

62 PAPERS • NO BENCHMARKS YET

VisDrone

VisDrone is a large-scale benchmark with carefully annotated ground-truth for various important computer vision tasks, to make vision meet drones. The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining, Tianjin University, China. The benchmark dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes). Note that, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios, and under various weather and lighting conditions. These frames are manually annotated with more than 2.6 million bounding boxes of targets of frequent interests, such as pedestrians, cars, bicycl

61 PAPERS • 2 BENCHMARKS

ETH (ETH Pedestrian)

ETH is a dataset for pedestrian detection. The testing set contains 1,804 images in three video clips. The dataset is captured from a stereo rig mounted on car, with a resolution of 640 x 480 (bayered), and a framerate of 13--14 FPS.

59 PAPERS • 5 BENCHMARKS

SPair-71k

SPair-71k contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger in number and contains more accurate and richer annotations.

58 PAPERS • 2 BENCHMARKS

iSAID

iSAID contains 655,451 object instances for 15 categories across 2,806 high-resolution images. The images of iSAID is the same as the DOTA-v1.0 dataset, which are manily collected from the Google Earth, some are taken by satellite JL-1, the others are taken by satellite GF-2 of the China Centre for Resources Satellite Data and Application.

57 PAPERS • 3 BENCHMARKS

MSCOCO

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

55 PAPERS • NO BENCHMARKS YET

PASCAL-Part

PASCAL-Part is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL object detection task by providing segmentation masks for each body part of the object. For categories that do not have a consistent set of parts (e.g., boat), it provides the silhouette annotation.

55 PAPERS • 4 BENCHMARKS

UIEB

UIEB (Underwater Image Enhancement Benchmark Dataset)

Includes 950 real-world underwater images, 890 of which have the corresponding reference images.

54 PAPERS • 1 BENCHMARK

DAQUAR

DAQUAR (DAtaset for QUestion Answering on Real-world images) is a dataset of human question answer pairs about images.

51 PAPERS • NO BENCHMARKS YET

UA-DETRAC

Consists of 100 challenging video sequences captured from real-world traffic scenes (over 140,000 frames with rich annotations, including occlusion, weather, vehicle category, truncation, and vehicle bounding boxes) for object detection, object tracking and MOT system.

50 PAPERS • 1 BENCHMARK

LLVIP (A Visible-infrared Paired Dataset for Low-light Vision)

Visible-infrared Paired Dataset for Low-light Vision 30976 images (15488 pairs) 24 dark scenes, 2 daytime scenes Support for image-to-image translation (visible to infrared, or infrared to visible), visible and infrared image fusion, low-light pedestrian detection, and infrared pedestrian detection (The original image and video pairs (before registration) of LLVIP are also released!)

49 PAPERS • 6 BENCHMARKS

WoodScape

Fisheye cameras are commonly employed for obtaining a large field of view in surveillance, augmented reality and in particular automotive applications. In spite of its prevalence, there are few public datasets for detailed evaluation of computer vision algorithms on fisheye images. WoodScape is an extensive fisheye automotive dataset named after Robert Wood who invented the fisheye camera in 1906. WoodScape comprises of four surround view cameras and nine tasks including segmentation, depth estimation, 3D bounding box detection and soiling detection. Semantic annotation of 40 classes at the instance level is provided for over 10,000 images and annotation for other tasks are provided for over 100,000 images.

49 PAPERS • 1 BENCHMARK

Synscapes

Synscapes is a synthetic dataset for street scene parsing created using photorealistic rendering techniques, and show state-of-the-art results for training and validation as well as new types of analysis.

42 PAPERS • 1 BENCHMARK

COCO-O

COCO-O(ut-of-distribution) contains 6 domains (sketch, cartoon, painting, weather, handmake, tattoo) of COCO objects which are hard to be detected by most existing detectors. The dataset has a total of 6,782 images and 26,624 labelled bounding boxes.

41 PAPERS • 1 BENCHMARK

ExDark (Exclusively Dark Image Dataset)

The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 different conditions) with 12 object classes (similar to PASCAL VOC) annotated on both image class level and local object bounding boxes.

41 PAPERS • 1 BENCHMARK

A2D (Actor-Action Dataset)

A2D (Actor-Action Dataset) is a dataset for simultaneously inferring actors and actions in videos. A2D has seven actor classes (adult, baby, ball, bird, car, cat, and dog) and eight action classes (climb, crawl, eat, fly, jump, roll, run, and walk) not including the no-action class, which we also consider. The A2D has 3,782 videos with at least 99 instances per valid actor-action tuple and videos are labeled with both pixel-level actors and actions for sampled frames. The A2D dataset serves as a large-scale testbed for various vision problems: video-level single- and multiple-label actor-action recognition, instance-level object segmentation/co-segmentation, as well as pixel-level actor-action semantic segmentation to name a few.

40 PAPERS • 1 BENCHMARK

Clipart1k

In Clipart1k, the target domain classes to be detected are the same as those in the source domain. All the images for a clipart domain were collected from one dataset (i.e., CMPlaces) and two image search engines (i.e., Openclipart2 and Pixabay3). Search queries used are 205 scene classes (e.g., pasture) used in CMPlaces to collect various objects and scenes with complex backgrounds.

39 PAPERS • NO BENCHMARKS YET

ScanRefer Dataset

Contains 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.

37 PAPERS • NO BENCHMARKS YET

A*3D

The A*3D dataset is a step forward to make autonomous driving safer for pedestrians and the public in the real world. Characteristics: * 230K human-labeled 3D object annotations in 39,179 LiDAR point cloud frames and corresponding frontal-facing RGB images. * Captured at different times (day, night) and weathers (sun, cloud, rain).

34 PAPERS • NO BENCHMARKS YET

Watercolor2k

Watercolor2k is a dataset used for cross-domain object detection which contains 2k watercolor images with image and instance-level annotations.

34 PAPERS • 4 BENCHMARKS