Visual Entailment

29 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Libraries

Use these libraries to find Visual Entailment models and implementations
2 papers
2,359

Most implemented papers

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

How Much Can CLIP Benefit Vision-and-Language Tasks?

clip-vil/CLIP-ViL 13 Jul 2021

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

ofa-sys/ofa 7 Feb 2022

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

researchmm/soho CVPR 2021

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

zhegan27/VILLA NeurIPS 2020

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning.

Distilled Dual-Encoder Model for Vision-Language Understanding

kugwzk/distilled-dualencoder 16 Dec 2021

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.

Visual Spatial Reasoning

cambridgeltl/visual-spatial-reasoning 30 Apr 2022

Spatial relations are a basic part of human cognition.

Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning

khuangaf/chocolate 15 Dec 2023

This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

JinYuanLi0012/RiVEG 15 Feb 2024

Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions.