Visual Entailment
27 papers with code • 3 benchmarks • 3 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Libraries
Use these libraries to find Visual Entailment models and implementationsMost implemented papers
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.
Check It Again: Progressive Visual Question Answering via Visual Entailment
Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.
Check It Again:Progressive Visual Question Answering via Visual Entailment
Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Current NLE models explain the decision-making process of a vision or vision-language model (a. k. a., task model), e. g., a VQA model, via a language model (a. k. a., explanation model), e. g., GPT.
Fine-Grained Visual Entailment
In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
MixGen: A New Multi-Modal Data Augmentation
Data augmentation is a necessity to enhance data efficiency in deep learning.
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
CSI), a relation inferrer, and a Lexical Constraint-aware Generator (arr.
Prompt Tuning for Generative Multimodal Pretrained Models
Prompt tuning has become a new paradigm for model tuning and it has demonstrated success in natural language pretraining and even vision pretraining.
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks.
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.