BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

31 Mar 2022  ·  Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai ·

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Object Detection DAIR-V2X-I BEVFormer AP|R40(moderate) 50.7 # 8
AP|R40(easy) 61.4 # 8
AP|R40(hard) 50.7 # 8
Bird's-Eye View Semantic Segmentation Lyft Level 5 BEVFormer (EfficientNet-b4) IoU vehicle - 224x480 - Long 44.5 # 2
IoU vehicle - 224x480 - Short 69.9 # 5
Bird's-Eye View Semantic Segmentation Lyft Level 5 BEVFormer(ResNet-50) IoU vehicle - 224x480 - Long 43.2 # 6
IoU vehicle - 224x480 - Short 68.8 # 6
3D Object Detection nuScenes BEVFormer NDS 0.57 # 213
mAP 0.48 # 205
mATE 0.58 # 86
mASE 0.26 # 47
mAOE 0.38 # 167
mAVE 0.38 # 138
mAAE 0.13 # 103
Bird's-Eye View Semantic Segmentation nuScenes BEVFormer IoU veh - 224x480 - No vis filter - 100x100 at 0.5 35.8 # 6
IoU veh - 448x800 - No vis filter - 100x100 at 0.5 39.0 # 4
IoU veh - 224x480 - Vis filter. - 100x100 at 0.5 42.0 # 4
IoU veh - 448x800 - Vis filter. - 100x100 at 0.5 45.5 # 4
IoU lane - 224x480 - 100x100 at 0.5 25.7 # 5
Robust Camera Only 3D Object Detection nuScenes-C BEVFormer (small) mean Corruption Error (mCE) 102.4 # 14
mean Resilience Rate (mRR) 59.07 # 14
Robust Camera Only 3D Object Detection nuScenes-C BEVFormer (base) mean Corruption Error (mCE) 97.97 # 3
mean Resilience Rate (mRR) 60.4 # 13
3D Object Detection nuScenes Camera Only BEVFormer NDS 56.9 # 17
Future Frame false # 1

Methods


No methods listed for this paper. Add relevant methods here