A large-scale comprehensive collection of dashcam videos collected by vehicles on DiDi's platform. D2-City contains more than 10000 video clips which deeply reflect the diversity and complexity of real-world traffic scenarios in China.
1 PAPER • NO BENCHMARKS YET
Object Detection data set created from the engine DeepGTAV, which is based on the video game GTAV. Part of the three data sets proposed in the paper. This data set is motivated from the Cattle dataset with almost the same classes.
Object Detection data set created from the engine DeepGTAV, which is based on the video game GTAV. Part of the three data sets proposed in the paper. This data set is motivated from the SeaDronesSee dataset with almost the same classes.
Object Detection data set created from the engine DeepGTAV, which is based on the video game GTAV. Part of the three data sets proposed in the paper. This data set is motivated from the VisDrone data set with almost the same classes.
1 PAPER • 1 BENCHMARK
About the Dataset: 4 classes of drinking waste: Aluminium Cans, Glass bottles, PET (plastic) bottles and HDPE (plastic) Milk bottles. rawimgs - images of 4 classes of waste YOLO_imgs - images of 4 classes of waste with corresponding txt file (annotations for YOLO framework) labels.txt - labels of the classes
For the Drone-vs-Bird Detection Challenge 2021, 77 different video sequences have been made available as training data. These video sequences originate from the previous installment of the challenge and were collected using MPEG4-coded static cameras by the SafeShore project, by the Fraunhofer IOSB research institute and by the ALADDIN2 project. On average, the video sequences consist of 1,384 frames, while each frame contains 1.12 annotated drones. The video sequences are recorded with both static cameras and moving cameras and the resolution varies between 720×576 and 3840×2160 pixels. In total, 8 different types of drones exist in the dataset , i.e. 3 with fixed wings and 5 rotary ones. For each video, a separate annotation file is provided, which contains the frame number and the bounding box (expressed as [topx topy width height]) for the frames in which drones enter the scenes.
The EXPO-HD Dataset is a dataset of Expo whiteboard markers for the purpose of instance segmentation. The dataset contains two subsets (both include instances segmentation labels):
This is a detailed description of the dataset, a data sheet for the dataset as proposed by Gebru et al.
Fine-Grained Vehicle Detection (FGVD) is a dataset for fine-grained vehicle detection captured from a moving camera mounted on a car. The FGVD dataset is challenging as it has vehicles in complex traffic scenarios with intra-class and inter-class variations in types, scale, pose, occlusion, and lighting conditions.
FSVOD-500 is a large-scale video dataset comprising of 500 classes with class-balanced videos in each category for few-shot learning. FSVOD-500 is the first benchmark specially designed for few-shot video object detection for evaluating the performance of a given model on novel classes.
FractureAtlas is a musculoskeletal bone fracture dataset with annotations for deep learning tasks like classification, localization, and segmentation. The dataset contains a total of 4,083 X-Ray images with annotation in COCO, VGG, YOLO, and Pascal VOC format. This dataset is made freely available for any purpose. The data provided within this work are free to copy, share or redistribute in any medium or format. The data might be adapted, remixed, transformed, and built upon. The dataset is licensed under a CC-BY 4.0 license. It should be noted that to use the dataset correctly, one needs to have knowledge of medical and radiology fields to understand the results and make conclusions based on the dataset. It's also important to consider the possibility of labeling errors.
The GDIT Aerial Airport dataset consists of aerial images containing instances of parked airplanes. All plane types have been grouped into a single classification named "airplane".
data/images:
Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD – an assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos, 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process effici
Hands Guns and Phones (HGP) dataset contains 2199 images (1989 for training an 210 for testing) of people using guns or phones in real-world scenarios (people making phones reviews, shooting drills, or making calls). Every image of this dataset is labeled with the bounding boxes of Hands, Phones and Guns. All the aforementioned images were collected from Youtube videos and have different sizes.
The Instance Segmentation task, an extension of the well-known Object Detection task, is of great help in many areas, such as precision agriculture: being able to automatically identify plant organs and the possible diseases associated with them, allows to effectively scale and automate crop monitoring and its diseases control.
1 PAPER • 2 BENCHMARKS
This data set contains 775 video sequences, captured in the wildlife park Lindenthal (Cologne, Germany) as part of the AMMOD project, using an Intel RealSense D435 stereo camera. In addition to color and infrared images, the D435 is able to infer the distance (or “depth”) to objects in the scene using stereo vision. Observed animals include various birds (at daytime) and mammals such as deer, goats, sheep, donkeys, and foxes (primarily at nighttime). A subset of 412 images is annotated with a total of 1038 individual animal annotations, including instance masks, bounding boxes, class labels, and corresponding track IDs to identify the same individual over the entire video.
Loucount is a retail object detection and and counting dataset with rich annotations in retail stores, which consists of 50, 394 images with more than 1.9 million object instances in 140 categories
A small-scale training set, which only contains 4K images.
METU-ALET is an image dataset for the detection of the tools in the wild. The dataset has annotations for tools that belongs to the categories such as farming, gardening, office, stonemasonry, vehicle, woodworking and workshop. The images in the dataset contains a total of 22,841 bounding boxes and 49 different tool categories.
Minor Irrigation Structures Check-Dam Dataset is a public dataset annotated by domain experts using images from Google static map for instance segmentation and object detection tasks.
MlGesture is a dataset for hand gesture recognition tasks, recorded in a car with 5 different sensor types at two different viewpoints. The dataset contains over 1300 hand gesture videos from 24 participants and features 9 different hand gesture symbols. One sensor cluster with five different cameras is mounted in front of the driver in the center of the dashboard. A second sensor cluster is mounted on the ceiling looking straight down.
Marine Microalgae Detection in Microscopy Images dataset contains a total number of images in the dataset is 937 and all the objects in these images were annotated. The total number of annotated objects is 4201. The training set contains 537 images and the testing set contains 430 images.
MuCeD, a dataset that is carefully curated and validated by expert pathologists from the All India Institute of Medical Science (AIIMS), Delhi, India. The H&E-stained histopathology images of the human duodenum in MuCeD are captured through an Olympus BX50 microscope at 20x zoom using a DP26 camera with each image being 1920x2148 in dimension. The dataset has 55 images, with bounding boxes for 2,090 IELs and 6,518 ENs annotated using the LabelMe software and are further validated by multiple pathologists. These cells are selected from the epithelial area -- a region of interest that has been explicitly segmented by experts. The epithelial area denotes the area of continuous villi and is used for cell detection, whereas rest of the area is masked out. Further, each image is sliced into 9 subimages and each subimage is re-scaled to 640x640, before it is given as input to object detection models. We divide 55 images into five folds of 11 images each and report 5-fold crossvalidation num
Natural Adversarial Objects (NAO) is a new dataset to evaluate the robustness of object detection models. NAO contains 7,934 images and 9,943 objects that are unmodified and representative of real-world scenarios, but cause state-of-the-art detection models to misclassify with high confidence.
This is a high-quality large-scale Night Object Detection (NOD) dataset of outdoor images targeting low-light object detection. The dataset contains more than 7K images and 46K annotated objects (with bounding boxes) that belong to classes: person, bicycle, and car. The photos were taken on the streets at evening hours, and thus all images present low-light conditions to a varying degree of severity.
An object-centric version of Stylized COCO to benchmark texture bias and out-of-distribution robustness of vision models. See the ECCV 22 paper and supplementary material for details.
The PESMOD (PExels Small Moving Object Detection) dataset consists of high resolution aerial images in which moving objects are labelled manually. It was created from videos selected from the Pexels website. The aim of this dataset is to provide a different and challenging dataset for moving object detection methods evaluation. Each moving object is labelled for each frame with PASCAL VOC format in a XML file. The dataset consists of 8 different video sequences.
Object detection dataset featuring people walking on grass captured aboard a UAV. This data sets include precise meta data information about altitude, viewing angle and others.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
A multimodal dataset of radio galaxies and their corresponding infrared hosts.
A dataset to encourage the community to adapt oriented bounding box (OBB) detectors for more complex environments.
SeaDronesSee-Object Detection v2 (S-ODv2) dataset contains 14,227 RGB images (training: 8,930; validation: 1,547; testing: 3,750). The images are captured from various altitudes and viewing angles ranging from 5 to 260 meters and 0 to 90° degrees (gimbal pitch angle) while providing the respective meta information for altitude, viewing angle and other meta data for almost all frames.
SIDOD is a new, publicly-available image dataset generated by the NVIDIA Deep Learning Data Synthesizer intended for use in object detection, pose estimation, and tracking applications. This dataset contains 144k stereo image pairs that synthetically combine 18 camera viewpoints of three photorealistic virtual environments with up to 10 objects (chosen randomly from the 21 object models of the YCB dataset) and flying distractors.
This is a dataset to benchmark real-time embedded object detection models for RoboCup SSL (Small Size League).
STN PLAD is a high-resolution and real-world image dataset of multiple high-voltage power line components. It has 2,409 annotated objects divided into five classes: transmission tower, insulator, spacer, tower plate, and Stockbridge damper, which vary in size (resolution), orientation, illumination, angulation, and background.
A real-world image dataset that contains more than 900 images generated from 26 street cameras and 7 object categories annotated with detailed bounding box. The data distribution is non-IID and unbalanced, reflecting the characteristic real-world federated learning scenarios.
TAMPAR is a real-world dataset of parcel photos for tampering detection with annotations in COCO format. For details see the paper and for visual samples the project page. Features are:
The UAVVaste dataset consists to date of 772 images and 3716 annotations. The main motivation for creation of the dataset was the lack of domain-specific data. The datasets that are widely used for object detection evaluation benchmarking. The dataset is made publicly available and is intended to be expanded.
UDA-CH contains 16 objects that cover a variety of artworks which can be found in a museum like sculptures, paintings and books. Specifically, the dataset has been collected inside the cultural site “Galleria Regionale di Palazzo Bellomo” located in Siracusa, Italy.
This is the first general Underwater Image Instance Segmentation (UIIS) dataset containing 4,628 images for 7 categories with pixel-level annotations for underwater instance segmentation task
The Universal-Scale object detection Benchmark (USB) is a benchmark for object detection that has variations in object scales and image domains by incorporating COCO with the recently proposed Waymo Open Dataset and Manga109-s dataset. To enable fair comparison, USB establishes different protocols by defining multiple thresholds for training epochs and evaluation image resolutions.
Collects 60 reference sequences and 540 impaired sequences.
VizWiz-FewShot is a a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments.
YouTubeGun Detection Dataset is collected from 343 high-definition YouTube videos and contains 5000 well-chosen images, in which 16064 instances of gun and 9046 instances of person are annotated. Compared to other datasets, YouTube-GDD is "dynamic", containing rich contextual information
To encourage reproducible research, a labeled MultiRAW dataset containing>7k RAW images acquired using multiple camera sensors is made publicly accessible for RAW-domain processing.
This dataset is an extremely challenging set of over 8000+ original Fire and Smoke images captured and crowdsourced from over 1200+ urban and rural areas, where each image is manually reviewed and verified by computer vision professionals at Datacluster Labs.
0 PAPER • NO BENCHMARKS YET