Visual Entailment

13 papers with code • 2 benchmarks • 2 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Most implemented papers

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

researchmm/soho CVPR 2021

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

How Much Can CLIP Benefit Vision-and-Language Tasks?

clip-vil/CLIP-ViL 13 Jul 2021

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Visual Entailment Task for Visually-Grounded Language Learning

necla-ml/SNLI-VE 26 Nov 2018

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks.

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

necla-ml/SNLI-VE 20 Jan 2019

We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task.

Check It Again: Progressive Visual Question Answering via Visual Entailment

PhoebusSi/SAR 8 Jun 2021

Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.

Check It Again:Progressive Visual Question Answering via Visual Entailment

PhoebusSi/SAR ACL 2021

Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers.

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

fawazsammani/nlxgpt CVPR 2022

Current NLE models explain the decision-making process of a vision or vision-language model (a. k. a., task model), e. g., a VQA model, via a language model (a. k. a., explanation model), e. g., GPT.