We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity.
Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection.
In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target.
We introduce a technique for novel view synthesis and use it to transform collected data to the viewpoint of target rigs, allowing us to train BEV segmentation models for diverse target rigs without any additional data collection or labeling cost.
It involves a human-like screener with an objective to find the first k suitable candidates rather than the best k suitable candidates in a candidate pool given an initial screening order.
In contrast, we propose to use parametric depth distribution modeling for feature transformation.
This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop.
Ranked #1 on Prediction Of Occupancy Grid Maps on Occ3D-nuScenes
At a high level, global self-attentions enable the efficient cross-window communication at lower costs.
In uses of pre-trained machine learning models, it is a known issue that the target population in which the model is being deployed may not have been reflected in the source population with which the model was trained.
For any complainant, we find and compare similar protected and non-protected instances in the dataset used by the classifier to construct a control and test group, where a difference between the decision outcomes of the two groups implies potential individual discrimination.
To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images.
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations.
In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework.
Ranked #16 on Domain Generalization on ImageNet-C
In contrast, we propose to use parametric depth distribution modeling for feature transformation.
For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects and improving prediction recall.
Structured channel pruning has been shown to significantly accelerate inference time for convolution neural networks (CNNs) on modern hardware, with a relatively minor loss of network accuracy.
We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device.
Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect.
Existing semantic image retrieval methods often focus on mining for larger sized geographical landmarks, and/or require extra labeled data, such as images/image-pairs with similar objects, for mining images with generic objects.
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Knowledge distillation facilitates the training of a compact student network by using a deep teacher one.
Boundary pixels usually follow a multi-modal distribution as they represent different depths; Therefore, the assumption results in an erroneous depth prediction at the coarser level of the cost volume pyramid and can not be corrected in the refinement levels leading to wrong depth predictions.
Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations.
Ranked #4 on Domain Generalization on ImageNet-R (using extra training data)
In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs.
FreeSOLO further demonstrates superiority as a strong pre-training method, outperforming state-of-the-art self-supervised pre-training methods by +9. 8% AP when fine-tuning instance segmentation with only 5% COCO masks.
We investigate the interaction between categorical encodings and target encoding regularization methods that reduce unfairness.
In this paper, we propose a new light-weight self-supervised learning framework that could boost supervised learning performance with minimum additional computation cost.
Through extensive experiments on ImageNet, we show that EPI empowers a quick tracking of early training epochs suitable for pruning, offering same efficacy as an otherwise ``oracle'' grid-search that scans through epochs and requires orders of magnitude more compute.
We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget.
Specifically, we supervise the attention modules in the mask decoder in a layer-wise manner.
Ranked #4 on Panoptic Segmentation on COCO test-dev
Prior works usually assume that SC offers privacy benefits as only intermediate features, instead of private data, are shared from devices to the cloud.
Deep neural networks have reached high accuracy on object detection but their success hinges on large amounts of labeled data.
We study the problem of quantizing N sorted, scalar datapoints with a fixed codebook containing K entries that are allowed to be rescaled.
Knowledge distillation constitutes a simple yet effective way to improve the performance of a compact student network by exploiting the knowledge of a more powerful teacher.
We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.
Ranked #1 on Semantic Segmentation on COCO-Stuff full
In this work, we introduce GradInversion, using which input images from a larger batch (8 - 48 images) can also be recovered for large networks such as ResNets (50 layers), on complex datasets such as ImageNet (1000 classes, 224x224 px).
As a result, image resampling alone is not enough to yield a sufficiently balanced distribution at the object level.
Here, we propose a self-supervised learning framework for multi-view stereo that exploit pseudo labels from the input data.
Training on synthetic data can be beneficial for label or data-scarce scenarios.
Most of these methods are based on multiple models or are straightforward extensions of classification methods, hence estimate an image's informativeness using only the classification head.
For active learning, we propose a scoring function that aggregates uncertainties from both the classification and the localization outputs of the network.
While federated learning traditionally aims to train a single global model across decentralized local datasets, one model may not always be ideal for all participating clients.
We have built a scalable production system for active learning in the domain of autonomous driving.
In this paper we present EMOTIC, a dataset of images of people in a diverse set of natural situations, annotated with their apparent emotion.
Ranked #3 on Emotion Recognition in Context on EMOTIC (using extra training data)
We propose a cost volume-based neural network for depth inference from multi-view images.
Ranked #11 on 3D Reconstruction on DTU
We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network.
In this paper, we propose to scale up ensemble Active Learning methods to perform acquisition at a large scale (10k to 500k samples at a time).
Semantic segmentation with Convolutional Neural Networks is a memory-intensive task due to the high spatial resolution of feature maps and output predictions.
In this paper, we propose to scale up ensemble Active Learning (AL) methods to perform acquisition at a large scale (10k to 500k samples at a time).
One of the main challenges of deep learning tools is their inability to capture model uncertainty.
As evidenced by our experiments, our approach outperforms both training the compact network from scratch and performing knowledge distillation from a teacher.
Annotating the right data for training deep neural networks is an important challenge.
In this paper, we introduce Deep Probabilistic Ensembles (DPEs), a scalable technique that uses a regularized ensemble to approximate a deep Bayesian Neural Network (BNN).
Our approach builds on the observation that foreground and background classes are not affected in the same manner by the domain shift, and thus should be treated differently.
We show that domain transfer leads to large shifts in network activations and that it is desirable to take this into account when compressing.
Our experiments demonstrate the benefits of our classifier heatmaps and of our two-stream architecture on challenging urban scene datasets and on the YouTube-Objects benchmark, where we obtain state-of-the-art results.
In this paper, we go beyond this spatial information and propose a local-aware encoding of convolutional features based on semantic information predicted in the target image.
In this paper we present the Emotions in Context Database (EMCO), a dataset of images containing people in context in non-controlled environments.
We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network.
Hence, weak supervision using only image tags could have a significant impact in semantic segmentation.
These algorithms reduce the effect of lighting variations and weather conditions by exploiting the discriminant/invariant properties of different color representations.