SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.

Talking Face Generation Talking Head Generation

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes.

Autonomous Driving

Initializing Models with Larger Ones

Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era.

Knowledge Distillation

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

Image Classification Object Detection +3

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.

Image Captioning Video-based Generative Performance Benchmarking +2

Adversarial Diffusion Distillation

We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality.

Image Generation

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.

Zero-Shot Learning

RETVec: Resilient and Efficient Text Vectorizer

The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks.

Adversarial Text Metric Learning +1

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

Overall, our method can create lifelike avatars with dynamic, realistic and generalized appearances.

Instruction Tuning with Human Curriculum

The dominant paradigm for instruction tuning is the random-shuffled training of maximally diverse instruction-response pairs.

