This underscores the potential of DocRes across a broader spectrum of document image restoration tasks.
Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation.
We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation.
The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait.
Visual language models (VLMs) rapidly progressed with the recent success of large language models.
Ranked #24 on Visual Question Answering on MM-Vet
With the rapid advancement of Large Language Models (LLMs), significant progress has been made in multi-agent applications.
Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.
Ranked #6 on Visual Question Answering on MM-Vet
Large language models (LLMs) have fueled many intelligent agent tasks, such as web navigation -- but most existing agents perform far from satisfying in real-world webpages due to three factors: (1) the versatility of actions on webpages, (2) HTML text exceeding model processing capacity, and (3) the complexity of decision-making due to the open-domain nature of web.
We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).
A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks.