DDAD is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting. DDAD contains scenes from urban settings in the United States (San Francisco, Bay Area, Cambridge, Detroit, Ann Arbor) and Japan (Tokyo, Odaiba).
63 PAPERS • 1 BENCHMARK
The Shifts Dataset is a dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, `in-the-wild' distributional shifts and pose interesting challenges with respect to uncertainty estimation.
52 PAPERS • 1 BENCHMARK
Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of obstacle and free-space and provide a thorough comparison to several stereo-based baseline methods. The dataset will be made available to the community to foster further research on this important topic.
48 PAPERS • 1 BENCHMARK
The Talk2Car dataset finds itself at the intersection of various research domains, promoting the development of cross-disciplinary solutions for improving the state-of-the-art in grounding natural language into visual space. The annotations were gathered with the following aspects in mind: Free-form high quality natural language commands, that stimulate the development of solutions that can operate in the wild. A realistic task setting. Specifically, the authors consider an autonomous driving setting, where a passenger can control the actions of an Autonomous Vehicle by giving commands in natural language. The Talk2Car dataset was build on top of the nuScenes dataset to include an extensive suite of sensor modalities, i.e. semantic maps, GPS, LIDAR, RADAR and 360-degree RGB images annotated with 3D bounding boxes. Such variety of input modalities sets the object referral task on the Talk2Car dataset apart from related challenges, where additional sensor modalities are generally missing
39 PAPERS • 1 BENCHMARK
Berkeley Deep Drive-X (eXplanation) is a dataset is composed of over 77 hours of driving within 6,970 videos. The videos are taken in diverse driving conditions, e.g. day/night, highway/city/countryside, summer/winter etc. On average 40 seconds long, each video contains around 3-4 actions, e.g. speeding up, slowing down, turning right etc., all of which are annotated with a description and an explanation. Our dataset contains over 26K activities in over 8.4M frames.
35 PAPERS • NO BENCHMARKS YET
The Argoverse 2 Motion Forecasting Dataset is a curated collection of 250,000 scenarios for training and validation. Each scenario is 11 seconds long and contains the 2D, birds-eye-view centroid and heading of each tracked object sampled at 10 Hz.
18 PAPERS • NO BENCHMARKS YET
Aiming Detect small obstacles, like lost and found.
7 PAPERS • 2 BENCHMARKS
ELAS is a dataset for lane detection. It contains more than 20 different scenes (in more than 15,000 frames) and considers a variety of scenarios (urban road, highways, traffic, shadows, etc.). The dataset was manually annotated for several events that are of interest for the research community (i.e., lane estimation, change, and centering; road markings; intersections; LMTs; crosswalks and adjacent lanes).
4 PAPERS • NO BENCHMARKS YET
These images were generated using Blender and IEE-Simulator with different head-poses, where the images are labelled according to nine classes (straight, turned bottom-left, turned left, turned top-left, turned bottom-right, turned right, turned top-right, reclined, looking up). The dataset contains 16,013 training images and 2,825 testing images, in addition to 4,700 images for improvements.
These images were generated using UnityEyes simulator, after including essential eyeball physiology elements and modeling binocular vision dynamics. The images are annotated with head pose and gaze direction information, besides 2D and 3D landmarks of eye's most important features. Additionally, the images are distributed into two classes denoting the status of the eye (Open for open eyes, Closed for closed eyes). This dataset was used to train a DNN model for detecting drowsiness status of a driver. The dataset contains 1,704 training images, 4,232 testing images and additional 4,103 images for improvements.
These images were generated using UnityEyes simulator, after including essential eyeball physiology elements and modeling binocular vision dynamics. The images are annotated with head pose and gaze direction information, besides 2D and 3D landmarks of eye's most important features. Additionally, the images are distributed into eight classes denoting the gaze direction of a driver's eyes (TopLeft, TopRight, TopCenter, MiddleLeft, MiddleRight, BottomLeft, BottomRight, BottomCenter). This dataset was used to train a DNN model for estimating the gaze direction. The dataset contains 61,063 training images, 132,630 testing images and additional 72,000 images for improvement.
3 PAPERS • NO BENCHMARKS YET
A large-scale and accurate dataset for vision-based railway traffic light detection and recognition.The recordings were made on selected running trains in France and benefited from carefully hand-labeled annotations.
2 PAPERS • NO BENCHMARKS YET
Studying how human drivers react differently when following autonomous vehicles (AV) vs. human-driven vehicles (HV) is critical for mixed traffic flow. This dataset contains extracted and enhanced two categories of car-following data, HV-following-AV (H-A) and HV-following-HV (H-H), from the open Lyft level-5 dataset.
1 PAPER • NO BENCHMARKS YET
SEmantic Salient Instance Video (SESIV) dataset is obtained by augmenting the DAVIS-2017 benchmark dataset by assigning semantic ground-truth for salient instance labels. The SESIV dataset consists of 84 high-quality video sequences with pixel-wisely per-frame ground-truth labels.
The TUT Sounds Event 2018 dataset consists of real-life first order Ambisonic (FOA) format recordings with stationary point sources each associated with a spatial coordinate. The dataset was generated by collecting impulse responses (IR) from a real environment using the Eigenmike spherical microphone array. The measurement was done by slowly moving a Genelec G Two loudspeaker continuously playing a maximum length sequence around the array in circular trajectory in one elevation at a time. The playback volume was set to be 30 dB greater than the ambient sound level. The recording was done in a corridor inside the university with classrooms around it during work hours. The IRs were collected at elevations −40 to 40 with 10-degree increments at 1 m from the Eigenmike and at elevations −20 to 20 with 10-degree increments at 2 m.
0 PAPER • NO BENCHMARKS YET