We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision.
LoReFT is a drop-in replacement for existing PEFTs and learns interventions that are 10x-50x more parameter-efficient than prior state-of-the-art PEFTs.
Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model.
Ranked #49 on Arithmetic Reasoning on GSM8K (using extra training data)
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.
We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.
It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef).
Ranked #1 on RGB Salient Object Detection on HRSOD (using extra training data)
Camouflaged Object Segmentation Dichotomous Image Segmentation +3
Prompt compression is an innovative method for efficiently condensing input prompts while preserving essential information.
In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.
Ranked #2 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.
Ranked #1 on Video Classification on COIN
These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration.