To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information.
We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision.
Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on.
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)
Image diffusion models have been utilized in various tasks, such as text-to-image generation and controllable image synthesis.
In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
Ranked #2 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA.
In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length.
We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).