Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,321

Most implemented papers

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

necla-ml/SNLI-VE 20 Jan 2019

We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.

Check It Again: Progressive Visual Question Answering via Visual Entailment

PhoebusSi/SAR 8 Jun 2021

Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.

Check It Again:Progressive Visual Question Answering via Visual Entailment

PhoebusSi/SAR ACL 2021

Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

fawazsammani/nlxgpt CVPR 2022

Current NLE models explain the decision-making process of a vision or vision-language model (a. k. a., task model), e. g., a VQA model, via a language model (a. k. a., explanation model), e. g., GPT.

Fine-Grained Visual Entailment

skrighyz/fgve 29 Mar 2022

In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.

MixGen: A New Multi-Modal Data Augmentation

amazon-research/mix-generation 16 Jun 2022

Data augmentation is a necessity to enhance data efficiency in deep learning.

Prompt Tuning for Generative Multimodal Pretrained Models

ofa-sys/ofa 4 Aug 2022

Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural language pretraining and even vision pretraining.

Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment

mshukor/vicha 29 Aug 2022

Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks.

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

iigroup/map CVPR 2023

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.