Image composition task could be decomposed into multiple sub-tasks, in which each sub-task targets at one or more issues.
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM.
Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds.
Rapid progress in text-to-motion generation has been largely driven by diffusion models.
Ranked #1 on
Motion Synthesis
on KIT Motion-Language
We present Vchitect-2. 0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.
Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise.
We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI.
Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37. 78% in recall@20 and 39. 90% in recall@50.
Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.
Ranked #1 on
Lipreading
on LRS2
(using extra training data)
Audio-Visual Speech Recognition
Automatic Speech Recognition
+4