Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.
Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior.
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
Ranked #1 on
Question Answering
on PIQA
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.
A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.
Ranked #4 on
Video Generation
on UCF-101
Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies.
Ranked #1 on
3D Object Detection
on Argoverse2
To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps.
The performance of video prediction has been greatly boosted by advanced deep neural networks.
Ranked #1 on
Video Prediction
on DAVIS 2017