The rapid development of large language models has revolutionized code intelligence in software development.
Ranked #4 on
Code Generation
on APPS
The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding.
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).
Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story).
We conclude that TOTEM matches or outperforms existing state-of-the-art models in both the canonical specialist setting (i. e., training one model on one domain) as well as the generalist setting (i. e., training a single model on many domains), which demonstrates the efficacy of tokenization for general time series analysis.
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM.
We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI.
We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.
With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VAE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models.
Image composition task could be decomposed into multiple sub-tasks, in which each sub-task targets at one or more issues.