Visual Entailment
23 papers with code • 3 benchmarks • 3 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Libraries
Use these libraries to find Visual Entailment models and implementationsMost implemented papers
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
How Much Can CLIP Benefit Vision-and-Language Tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
Distilled Dual-Encoder Model for Vision-Language Understanding
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.
Visual Spatial Reasoning
Spatial relations are a basic part of human cognition.
CoCa: Contrastive Captioners are Image-Text Foundation Models
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Visual Entailment Task for Visually-Grounded Language Learning
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks.
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.