BDD-A (Berkeley DeepDrive Attention)

Introduced by Xia et al. in Predicting Driver Attention in Critical Situations

Dataset Statistics: The statistics of our dataset are summarized and compared with the largest existing dataset (DR(eye)VE) [1] in Table 1. Our dataset was collected using videos selected from a publicly available, large-scale, crowd-sourced driving video dataset, BDD100k [30, 31]. BDD100K contains human-demonstrated dashboard videos and time-stamped sensor measurements collected during urban driving in various weather and lighting conditions. To efficiently collect attention data for critical driving situations, we specifically selected video clips that both included braking events and took place in busy areas (see supplementary materials for technical details). We then trimmed videos to include 6.5 seconds prior to and 3.5 seconds after each braking event. It turned out that other driving actions, e.g., turning, lane switching and accelerating, were also included. 1,232 videos (=3.5 hours) in total were collected following these procedures. Some example images from our dataset are shown in Fig. 6. Our selected videos contain a large number of different road users. We detected the objects in our videos using YOLO [22].On average, each video frame contained 4.4 cars and 0.3 pedestrians, multiple times more than the DR(eye)VE dataset (Table 1). Data Collection Procedure: For our eye-tracking experiment, we recruited 45 participants who each had more than one year of driving experience. The participants watched the selected driving videos in the lab while performing a driving instructor task: participants were asked to imagine that they were driving instructors sitting in the copilot seat and needed to press the space key whenever they felt it necessary to correct or warn the student driver of potential dangers. Their eye movements during the task were recorded at 1000 Hz with an EyeLink 1000 desktop-mounted infrared eye tracker, used in conjunction with the Eyelink Toolbox scripts [7] for MATLAB. Each participant completed the task for 200 driving videos. Each driving video was viewed by at least 4 participants. The gaze patterns made by these independent participants were aggregated and smoothed to make an attention map for each frame of the stimulus video (see Fig. 6 and supplementary materials for technical details). Psychological studies [19, 11] have shown that when humans look through multiple visual cues that simultaneously demand attention, the order in which humans look at those cues is highly subjective. Therefore, by aggregating gazes of independent observers, we could record multiple important visual cues in one frame. In addition, it has been shown that human drivers look at buildings, trees, flowerbeds, and other unimportant objects non-negligibly frequently [1]. Presumably, these eye movements should be regarded as noise for driving-related machine learning purposes. By averaging the eye movements of independent observers, we were able to effectively wash out those sources of noise (see Fig. 2B). Comparison with In-Car Attention Data: We collected in-lab driver attention data using videos from the DR(eye)VE dataset. This allowed us to compare in-lab and in-car attention maps of each video. The DR(eye)VE videos we used were 200 randomly selected 10-second video clips, half of them containing braking events and half without braking events. We tested how well in-car and in-lab attention maps highlighted driving-relevant objects. We used YOLO [22] to detect the objects in the videos of our dataset. We identified three object categories that are important for driving and that had sufficient instances in the videos (car, pedestrian and cyclist). We calculated the proportion of attended objects out of total detected instances for each category for both in-lab and in-car attention maps (see supplementary materials for technical details). The results showed that in-car attention maps highlighted significantly less driving-relevant objects than in-lab attention maps (see Fig. 2A). The difference in the number of attended objects between the in-car and in-lab attention maps can be due to the fact that eye movements collected from a single driver do not completely indicate all the objects that demand attention in the particular driving situation. One individual’s eye movements are only an approximation of their attention [23], and humans can also track objects with covert attention without looking at them [6]. The difference in the number of attended objects may also reflect the difference between first-person driver attention and third-person driver attention. It may be that the human observers in our in-lab eye-tracking experiment also looked at objects that were not relevant for driving. We ran a human evaluation experiment to address this concern. Human Evaluation: To verify that our in-lab driver attention maps highlight regions that should indeed demand drivers’ attention, we conducted an online study to let humans compare in-lab and in-car driver attention maps. In each trial of the online study, participants watched one driving video clip three times: the first time with no edit, and then two more times in random order with overlaid in-lab and in-car attention maps, respectively. The participant was then asked to choose which heatmap-coded video was more similar to where a good driver would look. In total, we collected 736 trials from 32 online participants. We found that our in-lab attention maps were more often preferred by the participants than the in-car attention maps (71% versus 29% of all trials, statistically significant as p = 1×10−29, see Table 2). Although this result cannot suggest that in-lab driver attention maps are superior to in-car attention maps in general, it does show that the driver attention maps collected with our protocol represent where a good driver should look from a third-person perspective. In addition, we will show in the Experiments section that in-lab attention data collected using our protocol can be used to train a model to effectively predict actual, in-car driver attention. This result proves that our dataset can also serve as a substitute for in-car driver attention data, especially in crucial situations where in-car data collection is not practical. To summarize, compared with driver attention data collected in-car, our dataset has three clear advantages: multi-focus, little driving-irrelevant noise, and efficiently tailored to crucial driving situations.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


Modalities


Languages