Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Entailment

Dataset	Best Model	Compare
SNLI-VE val	OFA	See all
SNLI-VE test	OFA	See all
e-SNLI-VE	OFA-X	See all

Libraries

Use these libraries to find Visual Entailment models and implementations

ofa-sys/ofa

2 papers

2,320

Datasets

Latest papers

Most implemented Social Latest No code

MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

songrise/mope • 14 Mar 2024

Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness.

14 Mar 2024

Paper
Code

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

wuhy68/p-adapter • • 17 Dec 2023

In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively.

17 Dec 2023

Paper
Code

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

huggingface/transformers • • 15 Dec 2023

This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

124,793

15 Dec 2023

Paper
Code

Good Questions Help Zero-Shot Image Reasoning

kai-wen-yang/qvix • • 4 Dec 2023

QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.

04 Dec 2023

Paper
Code

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

yasminekaroui/clicotea • • 29 Jun 2023

Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora.

29 Jun 2023

Paper
Code

I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

tuhinjubcse/visualmetaphors • 24 May 2023

We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models. Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6, 476 visual metaphors for 1, 540 linguistic metaphors and their associated visual elaborations.

24 May 2023

Paper
Code

Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations

ofa-x/ofa-x • • 8 Dec 2022

Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models.

08 Dec 2022

Paper
Code

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

allenai/close • • ICCV 2023

We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images.

17 Nov 2022

Paper
Code

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

iigroup/map • • CVPR 2023

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.

11 Oct 2022

Paper
Code

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

mshukor/vicha • • 29 Aug 2022

Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks.

29 Aug 2022

Paper
Code

Visual Entailment

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result