MMR total

9 papers with code • 1 benchmarks • 1 datasets

Sum of all scores of the 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding in the Multi-Modal Reading (MMR) Benchmark.

Most implemented papers

GPT-4 Technical Report

openai/evals Preprint 2023

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

Visual Instruction Tuning

haotian-liu/LLaVA NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl CVPR 2024

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

huggingface/obelics NeurIPS 2023

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks.

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

qwenlm/qwen-vl 24 Aug 2023

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

qi-zhangyang/gemini-vs-gpt4v 29 Sep 2023

We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

yuliang-liu/monkey CVPR 2024

Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.

MMR: Evaluating Reading Ability of Large Multimodal Models

llavar/MMR_Bench 26 Aug 2024

Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

no code yet • 22 Apr 2024

We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.