Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,320

MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts

songrise/mope 14 Mar 2024

Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness.

6
14 Mar 2024

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

wuhy68/p-adapter 17 Dec 2023

In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively.

7
17 Dec 2023

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

huggingface/transformers 15 Dec 2023

This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

124,793
15 Dec 2023

Good Questions Help Zero-Shot Image Reasoning

kai-wen-yang/qvix 4 Dec 2023

QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.

10
04 Dec 2023

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

yasminekaroui/clicotea 29 Jun 2023

Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora.

4
29 Jun 2023

I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

tuhinjubcse/visualmetaphors 24 May 2023

We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models. Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6, 476 visual metaphors for 1, 540 linguistic metaphors and their associated visual elaborations.

8
24 May 2023

Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations

ofa-x/ofa-x 8 Dec 2022

Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models.

10
08 Dec 2022

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

allenai/close ICCV 2023

We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images.

50
17 Nov 2022

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

iigroup/map CVPR 2023

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.

28
11 Oct 2022

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

mshukor/vicha 29 Aug 2022

Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks.

43
29 Aug 2022