Computer Vision

Image Comprehension

7 papers with code • 0 benchmarks • 1 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Image Comprehension

No evaluation results yet. Help compare methods by submitting evaluation metrics.

Datasets

Visual7W

Latest papers with no code

Most implemented Social Latest No code

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

no code yet • 13 Feb 2024

We utilize user history as in-context user preferences to address the first challenge.

Paper
Add Code

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

no code yet • 29 Jan 2024

Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) tested, even though humans can attain approximately 99\% accuracy on these questions.

Paper
Add Code

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition

no code yet • 18 Jan 2024

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio.

Paper
Add Code

Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

no code yet • 16 Jan 2024

GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy.

Paper
Add Code

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

no code yet • 5 Jan 2024

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs.

Paper
Add Code

GeoLocator: a location-integrated large multimodal model for inferring geo-privacy

no code yet • 21 Nov 2023

Geographic privacy or geo-privacy refers to the keeping private of one's geographic location, especially the restriction of geographical data maintained by personal electronic devices.

Paper
Add Code

What Large Language Models Bring to Text-rich VQA?

no code yet • 13 Nov 2023

This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets.

Paper
Add Code

On the Performance of Multimodal Language Models

no code yet • 4 Oct 2023

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks.

Paper
Add Code

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code yet • 15 Sep 2023

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Paper
Add Code

Looking Through Glass: Knowledge Discovery from Materials Science Literature using Natural Language Processing

no code yet • 5 Jan 2021

Most of the knowledge in materials science literature is in the form of unstructured data such as text and images.

Paper
Add Code

Image Comprehension

Benchmarks Add a Result

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result