Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to utilize the diffusion model to get a better BEV representation.
In this work, we propose a training-Free conditional Diffusion Model (FreeDoM) used for various conditions.
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.
Ranked #4 on Video Generation on UCF-101
The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.
Ranked #1 on 3D Object Detection on nuScenes Camera Only
By keeping acquiring new visual information from BLIP-2's answers, ChatCaptioner is able to generate more enriched image descriptions.
For expressivity, we propose a new SSM parameterization based on the companion matrix -- a canonical representation for discrete-time processes -- which enables SpaceTime's SSM layers to learn desirable autoregressive processes.
We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer.
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks.
Ranked #1 on Referring Expression Comprehension on RefCOCOg-test (using extra training data)
Multi-Object Tracking and Segmentation Multiple Object Tracking +12
We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions.
Ranked #1 on Training-free 3D Part Segmentation on ShapeNet-Part
3D Point Cloud Classification Training-free 3D Part Segmentation +1
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance.
Ranked #2 on Language Modelling on C4