Visual Entailment
13 papers with code • 2 benchmarks • 2 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Most implemented papers
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
How Much Can CLIP Benefit Vision-and-Language Tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks.
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.
Check It Again: Progressive Visual Question Answering via Visual Entailment
Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.
Check It Again:Progressive Visual Question Answering via Visual Entailment
Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
Current NLE models explain the decision-making process of a vision or vision-language model (a. k. a., task model), e. g., a VQA model, via a language model (a. k. a., explanation model), e. g., GPT.