Visual Storytelling
25 papers with code • 1 benchmarks • 4 datasets
( Image credit: No Metrics Are Perfect )
Latest papers with no code
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings
Recent advances in large language models elicit reasoning in a chain-of-thought that allows models to decompose problems in a human-like fashion.
Visual Transformation Telling
In this paper, we propose a new visual reasoning task, called Visual Transformation Telling (VTT).
A-CAP: Anticipation Captioning with Commonsense Knowledge
Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time.
Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences
The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence.
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format.
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition.
Vision Transformer Based Model for Describing a Set of Images as a Story
Visual Story-Telling is the process of forming a multi-sentence story from a set of images.
Coherent Visual Storytelling via Parallel Top-Down Visual and Topic Attention
In this work, a coherent visual storytelling (CoVS) framework is designed to address the above-mentioned problems.
SentiStory: A Multi-Layered Sentiment-Aware Generative Model for Visual Storytelling
The visual storytelling (VIST) task aims at generating reasonable, human-like and coherent stories with the image streams as input.