We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
Ranked #1 on
Question Answering
on PIQA
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.
Ranked #4 on
Video Generation
on UCF-101
Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies.
Ranked #1 on
3D Object Detection
on Argoverse2
To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding.
The performance of video prediction has been greatly boosted by advanced deep neural networks.
Ranked #1 on
Video Prediction
on DAVIS 2017
To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps.
To bridge this gap, in this paper, we propose a novel NeRF-based LiDAR odometry and mapping approach, NeRF-LOAM, consisting of three modules neural odometry, neural mapping, and mesh reconstruction.