We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.
We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior.
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
Ranked #1 on
Question Answering
on PIQA
Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies.
Ranked #1 on
3D Object Detection
on Argoverse2
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation.
The performance of video prediction has been greatly boosted by advanced deep neural networks.
Ranked #1 on
Video Prediction
on DAVIS 2017