Visual Entailment

23 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,101

Most implemented papers

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

How Much Can CLIP Benefit Vision-and-Language Tasks?

clip-vil/CLIP-ViL 13 Jul 2021

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

researchmm/soho CVPR 2021

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Distilled Dual-Encoder Model for Vision-Language Understanding

kugwzk/distilled-dualencoder 16 Dec 2021

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

Visual Spatial Reasoning

cambridgeltl/visual-spatial-reasoning 30 Apr 2022

Spatial relations are a basic part of human cognition.

CoCa: Contrastive Captioners are Image-Text Foundation Models

lucidrains/CoCa-pytorch 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Visual Entailment Task for Visually-Grounded Language Learning

necla-ml/SNLI-VE 26 Nov 2018

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks.

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

necla-ml/SNLI-VE 20 Jan 2019

We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.