We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation.
The DeepSeek-VL family (both 1. 3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.
Ranked #31 on Visual Question Answering on MM-Vet
By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding.
We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4.
A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks.
Visual language models (VLMs) rapidly progressed with the recent success of large language models.
Ranked #24 on Visual Question Answering on MM-Vet
The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance.
We derive a novel, provably robust, and closed-form Bayesian update rule for online filtering in state-space models in the presence of outliers and misspecified measurement models.
In response to these challenges, we propose MMBench, a novel multi-modality benchmark.
Current fundus image analysis models are predominantly built for specific tasks relying on individual datasets.