Visual Entailment
27 papers with code • 3 benchmarks • 3 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Libraries
Use these libraries to find Visual Entailment models and implementationsLatest papers with no code
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing
Visual entailment (VE) is a multimodal reasoning task consisting of image-sentence pairs whereby a promise is defined by an image, and a hypothesis is described by a sentence.
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
We train models for these tasks in a zero-shot cross-modal transfer setting, a domain where the previous state-of-the-art method relied on the fixed scale noise injection, often compromising the semantic content of the original modality embedding.
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions.
Lightweight In-Context Tuning for Multimodal Unified Models
In-context learning (ICL) involves reasoning from given contextual examples.
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning
We exploit context by pretraining our model with datasets of three tasks: news image captioning where the news article is the context, contextual visual entailment, and keyword extraction from the context.
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Few-shot Multimodal Multitask Multilingual Learning
While few-shot learning as a transfer learning paradigm has gained significant traction for scenarios with limited data, it has primarily been explored in the context of building unimodal and unilingual models.
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift
Multimodal image-text models have shown remarkable performance in the past few years.
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
We concatenate all the compound tokens for further processing with multimodal encoder.
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.