MMR total
9 papers with code • 1 benchmarks • 1 datasets
Sum of all scores of the 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding in the Multi-Modal Reading (MMR) Benchmark.
Most implemented papers
GPT-4 Technical Report
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
Visual Instruction Tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks.
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.
MMR: Evaluating Reading Ability of Large Multimodal Models
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.