We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.
Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies.
Ranked #1 on
3D Object Detection
on Argoverse2
The performance of video prediction has been greatly boosted by advanced deep neural networks.
Ranked #1 on
Video Prediction
on DAVIS 2017
We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
To bridge this gap, in this paper, we propose a novel NeRF-based LiDAR odometry and mapping approach, NeRF-LOAM, consisting of three modules neural odometry, neural mapping, and mesh reconstruction.
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
Ranked #1 on
Question Answering
on PIQA
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.
Ranked #4 on
Video Generation
on UCF-101
Attempting to train the visual and text encoder to account for this shift results in catastrophic forgetting and a notable decrease in performance.
To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps.
We also have a better zero-shot shape-aware editing ability based on the text-to-video model.