Visual Entailment
27 papers with code • 3 benchmarks • 3 datasets
Visual Entailment (VE) - is a task consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal is to predict whether the image semantically entails the text.
Libraries
Use these libraries to find Visual Entailment models and implementationsLatest papers
MoPE: Parameter-Efficient and Scalable Multimodal Fusion via Mixture of Prompt Experts
Building upon this disentanglement, we introduce the mixture of prompt experts (MoPE) technique to enhance expressiveness.
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively.
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning
This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions.
Good Questions Help Zero-Shot Image Reasoning
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages
Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora.
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models. Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6, 476 visual metaphors for 1, 540 linguistic metaphors and their associated visual elaborations.
Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations
Natural language explanations promise to offer intuitively understandable explanations of a neural network's decision process in complex vision-language tasks, as pursued in recent VL-NLE models.
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images.
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks.