Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Entailment

Dataset	Best Model	Compare
SNLI-VE val	OFA	See all
SNLI-VE test	OFA	See all
e-SNLI-VE	OFA-X	See all

Libraries

Use these libraries to find Visual Entailment models and implementations

ofa-sys/ofa

2 papers

2,319

Datasets

Most implemented papers

Most implemented Social Latest No code

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER • • ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Paper
Code

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip • • 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Paper
Code

How Much Can CLIP Benefit Vision-and-Language Tasks?

clip-vil/CLIP-ViL • • 13 Jul 2021

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Paper
Code

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa • • 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

Paper
Code

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

researchmm/soho • • CVPR 2021

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Paper
Code

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA • • NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Paper
Code

Distilled Dual-Encoder Model for Vision-Language Understanding

kugwzk/distilled-dualencoder • • 16 Dec 2021

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.

Paper
Code

Visual Spatial Reasoning

cambridgeltl/visual-spatial-reasoning • • 30 Apr 2022

Spatial relations are a basic part of human cognition.

Paper
Code

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

khuangaf/chocolate • • 15 Dec 2023

This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

Paper
Code

Visual Entailment Task for Visually-Grounded Language Learning

necla-ml/SNLI-VE • 26 Nov 2018

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks.

Paper
Code

Visual Entailment

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result