The Human-Parts dataset is a dataset for human body, face and hand detection with ~15k images. It contains ~106k different annotations, with multiple annotations per image.
2 PAPERS • NO BENCHMARKS YET
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
11 PAPERS • 2 BENCHMARKS
IIIT-AR-13K is created by manually annotating the bounding boxes of graphical or page objects in publicly available annual reports. This dataset contains a total of 13k annotated page images with objects in five different popular categories - table, figure, natural image, logo, and signature. It is the largest manually annotated dataset for graphical object detection.
6 PAPERS • NO BENCHMARKS YET
IP102 contains more than 75,000 images belonging to 102 categories, which exhibit a natural long-tailed distribution.
17 PAPERS • 1 BENCHMARK
This dataset is an extremely challenging set of over 5000+ original India food images captured and crowdsourced from over 800+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at ****DC Labs.
0 PAPER • NO BENCHMARKS YET
InteriorNet is a RGB-D for large scale interior scene understanding and mapping. The dataset contains 20M images created by pipeline:
26 PAPERS • NO BENCHMARKS YET
JTA is a dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios.
32 PAPERS • 1 BENCHMARK
Kitchen Scenes is a multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud.
5 PAPERS • 1 BENCHMARK
Kvasir-Capsule dataset is the largest publicly released VCE dataset. In total, the dataset contains 47,238 labeled images and 117 videos, where it captures anatomical landmarks and pathological and normal findings. The results is more than 4,741,621 images and video frames altogether.
Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist.
147 PAPERS • 3 BENCHMARKS
A large-scale logo image database for logo detection and brand recognition from real-world product images.
3 PAPERS • NO BENCHMARKS YET
LVIS is a dataset for long tail instance segmentation. It has annotations for over 1000 object categories in 164k images.
448 PAPERS • 14 BENCHMARKS
A small-scale training set, which only contains 4K images.
1 PAPER • NO BENCHMARKS YET
METU-ALET is an image dataset for the detection of the tools in the wild. The dataset has annotations for tools that belongs to the categories such as farming, gardening, office, stonemasonry, vehicle, woodworking and workshop. The images in the dataset contains a total of 22,841 bounding boxes and 49 different tool categories.
MJU-Waste is an RGBD waste object segmentation dataset that is made public to facilitate future research in this area.
4 PAPERS • 1 BENCHMARK
A large-scale video dataset for MOR in aerial videos.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
60 PAPERS • NO BENCHMARKS YET
This dataset is an extremely challenging set of over 7000+ original Masks images captured and crowdsourced from over 1200+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at ****DC Labs.
MinneApple is a benchmark dataset for apple detection and segmentation. The fruits are labelled using polygonal masks for each object instance to aid in precise object detection, localization, and segmentation. Additionally, the dataset also contains data for patch-based counting of clustered fruits. The dataset contains over 41, 000 annotated object instances in 1000 images.
15 PAPERS • NO BENCHMARKS YET
This dataset is collected by DataCluster Labs, India. To download full dataset or to submit a request for your new data collection needs, please drop a mail to: sales@datacluster.ai This dataset is an extremely challenging set of over 3000+ original Mobile Phone images captured and crowdsourced from over 1000+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at ****DC Labs.
ModaNet is a street fashion images dataset consisting of annotations related to RGB images. ModaNet provides multiple polygon annotations for each image. Each polygon is associated with a label from 13 meta fashion categories. The annotations are based on images in the PaperDoll image set, which has only a few hundred images annotated by the superpixel-based tool.
19 PAPERS • 1 BENCHMARK
A dataset of all Moroccan money
The nocaps benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets.
144 PAPERS • 13 BENCHMARKS
Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community.
137 PAPERS • 3 BENCHMARKS
A realistic, diverse, and challenging dataset for object detection on images. The data was recorded at a beer tent in Germany and consists of 15 different categories of food and drink items.
2 PAPERS • 1 BENCHMARK
This dataset is an extremely challenging set of over 2000+ original Oximeter images captured and crowdsourced from over 300+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at Datacluster Labs.
PASCAL VOC 2007 is a dataset for image recognition. The twenty object classes that have been selected are:
119 PAPERS • 14 BENCHMARKS
This dataset is a collection of spoken language instructions for a robotic system to pick and place common objects. Text instructions and corresponding object images are provided. The dataset consists of situations where the robot is instructed by the operator to pick up a specific object and move it to another location: for example, Move the blue and white tissue box to the top right bin. This dataset consists of RGBD images, bounding box annotations, destination box annotations, and text instructions.
7 PAPERS • NO BENCHMARKS YET
Object detection dataset featuring people walking on grass captured aboard a UAV. This data sets include precise meta data information about altitude, viewing angle and others.
The PS-Battles dataset is gathered from a large community of image manipulation enthusiasts and provides a basis for media derivation and manipulation detection in the visual domain. The dataset consists of 102'028 images grouped into 11'142 subsets, each containing the original image as well as a varying number of manipulated derivatives.
A dataset of pedestrian traffic lights containing over 5000 photos taken at hundreds of intersections in Shanghai.
PlantDoc is a dataset for visual plant disease detection. The dataset contains 2,598 data points in total across 13 plant species and up to 17 classes of diseases, involving approximately 300 human hours of effort in annotating internet scraped images.
12 PAPERS • 1 BENCHMARK
Consists of over 50,000 frames and includes high-definition images with full resolution depth information, semantic segmentation (images), point-wise segmentation (point clouds), and detailed annotations for all vehicles and people.
13 PAPERS • NO BENCHMARKS YET
PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
104 PAPERS • 1 BENCHMARK
RADIATE (RAdar Dataset In Adverse weaThEr) is new automotive dataset created by Heriot-Watt University which includes Radar, Lidar, Stereo Camera and GPS/IMU. The data is collected in different weather scenarios (sunny, overcast, night, fog, rain and snow) to help the research community to develop new methods of vehicle perception. The radar images are annotated in 7 different scenarios: Sunny (Parked), Sunny/Overcast (Urban), Overcast (Motorway), Night (Motorway), Rain (Suburban), Fog (Suburban) and Snow (Suburban). The dataset contains 8 different types of objects (car, van, truck, bus, motorbike, bicycle, pedestrian and group of pedestrians).
19 PAPERS • 2 BENCHMARKS
The RIT-18 dataset was built for the semantic segmentation of remote sensing imagery. It was collected with the Tetracam Micro-MCA6 multispectral imaging sensor flown on-board a DJI-1000 octocopter.
8 PAPERS • NO BENCHMARKS YET
RPC is a large-scale retail product checkout dataset and collects 200 retail SKUs. The collected SKUs can be divided into 17 meta categories, i.e., puffed food, dried fruit, dried food, instant drink, instant noodles, dessert, drink, alcohol, milk, canned food, chocolate, gum, candy, seasoner, personal hygiene, tissue, stationery.
ReDWeb-S is a large-scale challenging dataset for Salient Object Detection. It has totally 3179 images with various real-world scenes and high-quality depth maps. The dataset is split into a training set with 2179 RGB-D image pairs and a testing set with the remaining 1000 image pairs.
A dataset to encourage the community to adapt oriented bounding box (OBB) detectors for more complex environments.
S2TLD is a traffic light dataset, which contains 5,786 images of approximately 1,080 * 1,920 pixels and 720 * 1,280 pixels. It also contains 5 categories (include red, yellow, green, off and wait on) of 1,4130 instances. The scenes cover a decent variety of road scenes and typical: * Busy street scenes inner-city, * Dense stop-and-go traffic * Strong changes in illumination/exposure * Flickering/Fluctuating traffic lights * Multiple visible traffic lights * Image parts that can be confused with traffic lights (e.g. large round tail lights)
SESYD "Systems Evaluation SYnthetic Documents" is a database of synthetical documents with groundtruth. This database targets two main research problems in the document image analysis field (i) symbol recognition and spotting in line drawing images (floorplans and electrical diagrams) (ii) character segmentation and recognition in geographical maps. The database is composed of eleven collections for performance evaluation containing 284k images, 190k symbols and 284k characters (k for thousand). SESYD is today a key database in the document image analysis field published in 2010 and referred by one hundred of citations into research papers.
SIDOD is a new, publicly-available image dataset generated by the NVIDIA Deep Learning Data Synthesizer intended for use in object detection, pose estimation, and tracking applications. This dataset contains 144k stereo image pairs that synthetically combine 18 camera viewpoints of three photorealistic virtual environments with up to 10 objects (chosen randomly from the 21 object models of the YCB dataset) and flying distractors.
The SIXray dataset is constructed by the Pattern Recognition and Intelligent System Development Laboratory, University of Chinese Academy of Sciences. It contains 1,059,231 X-ray images which are collected from some several subway stations. There are six common categories of prohibited items, namely, gun, knife, wrench, pliers, scissors and hammer. It has three subsets called SIXray10, SIXray100 and SIXray1000, There are image-level annotations provided by human security inspectors for the whole dataset. In addition the images in the test set are annotated with a bounding-box for each prohibited item to evaluate the performance of object localization.
31 PAPERS • 1 BENCHMARK
SKU110K-R is a dataset relabeled with oriented bounding boxes based on SKU110K. It is focused on evaluating oriented and densely packed object detection.
8 PAPERS • 1 BENCHMARK
SPair-71k contains 70,958 image pairs with diverse variations in viewpoint and scale. Compared to previous datasets, it is significantly larger in number and contains more accurate and richer annotations.
58 PAPERS • 2 BENCHMARKS
Specially designed to evaluate active learning for video object detection in road scenes.
4 PAPERS • NO BENCHMARKS YET
A salient object subitizing image dataset of about 14K everyday images which are annotated using an online crowdsourcing marketplace.
Contains 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.
38 PAPERS • NO BENCHMARKS YET