Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.
Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.
Image Captioning
Video-based Generative Performance Benchmarking
+2
We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality.
1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.
Ranked #1 on
Object Detection
on COCO 2017
(mAP metric)
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans.
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.
Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generation (SGG) that aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes.
Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.
Image-based Generative Performance Benchmarking
Language Modelling
+9
The recent advancements in text-to-3D generation mark a significant milestone in generative models, unlocking new possibilities for creating imaginative 3D assets across various real-world scenarios.
Overall, our method can create lifelike avatars with dynamic, realistic and generalized appearances.