Visual Entailment

27 papers with code • 3 benchmarks • 3 datasets

Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Entailment

Dataset	Best Model	Compare
SNLI-VE val	OFA	See all
SNLI-VE test	OFA	See all
e-SNLI-VE	OFA-X	See all

Libraries

Use these libraries to find Visual Entailment models and implementations

ofa-sys/ofa

2 papers

2,323

Datasets

Latest papers with no code

Most implemented Social Latest No code

VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing

no code yet • 5 Mar 2024

Visual entailment (VE) is a multimodal reasoning task consisting of image-sentence pairs whereby a promise is defined by an image, and a hypothesis is described by a sentence.

Paper
Add Code

ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks

no code yet • 27 Feb 2024

We train models for these tasks in a zero-shot cross-modal transfer setting, a domain where the previous state-of-the-art method relied on the fixed scale noise injection, often compromising the semantic content of the original modality embedding.

Paper
Add Code

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

no code yet • 15 Feb 2024

Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions.

Paper
Add Code

Lightweight In-Context Tuning for Multimodal Unified Models

no code yet • 8 Oct 2023

In-context learning (ICL) involves reasoning from given contextual examples.

Paper
Add Code

"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

no code yet • 1 Jun 2023

We exploit context by pretraining our model with datasets of three tasks: news image captioning where the news article is the context, contextual visual entailment, and keyword extraction from the context.

Paper
Add Code

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

no code yet • CVPR 2023

Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment.

Paper
Add Code

Few-shot Multimodal Multitask Multilingual Learning

no code yet • 19 Feb 2023

While few-shot learning as a transfer learning paradigm has gained significant traction for scenarios with limited data, it has primarily been explored in the context of building unimodal and unilingual models.

Paper
Add Code

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

no code yet • 15 Dec 2022

Multimodal image-text models have shown remarkable performance in the past few years.

Paper
Add Code

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

no code yet • 2 Dec 2022

We concatenate all the compound tokens for further processing with multimodal encoder.

Paper
Add Code

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.

Paper
Add Code

Visual Entailment

Benchmarks Add a Result

Libraries

Datasets

Latest papers with no code

Content

Benchmarks

Add a Result