Image Comprehension

7 papers with code • 0 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Datasets


Latest papers with no code

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

no code yet • 13 Feb 2024

We utilize user history as in-context user preferences to address the first challenge.

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

no code yet • 29 Jan 2024

Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99\% accuracy on these questions.

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

no code yet • 18 Jan 2024

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio.

Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

no code yet • 16 Jan 2024

GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy.

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

no code yet • 5 Jan 2024

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.

GeoLocator: a location-integrated large multimodal model for inferring geo-privacy

no code yet • 21 Nov 2023

Geographic privacy or geo-privacy refers to the keeping private of one's geographic location, especially the restriction of geographical data maintained by personal electronic devices.

What Large Language Models Bring to Text-rich VQA?

no code yet • 13 Nov 2023

This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets.

On the Performance of Multimodal Language Models

no code yet • 4 Oct 2023

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks.

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code yet • 15 Sep 2023

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Looking Through Glass: Knowledge Discovery from Materials Science Literature using Natural Language Processing

no code yet • 5 Jan 2021

Most of the knowledge in materials science literature is in the form of unstructured data such as text and images.