Video Description
25 papers with code • 0 benchmarks • 7 datasets
The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.
Source: Joint Event Detection and Description in Continuous Video Streams
Benchmarks
These leaderboards are used to track progress in Video Description
Datasets
Latest papers
JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models
However, the complexities of these diverse modalities pose challenges for developing an efficient multimodal emotion cause analysis (ECA) system.
FunQA: Towards Surprising Video Comprehension
Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention.
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable.
Fine-grained Audible Video Description
We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD).
Thinking Hallucination for Video Captioning
In video captioning, there are two kinds of hallucination: object and action hallucination.
What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics
While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world.
Learn to Understand Negation in Video Retrieval
We propose a learning based method for training a negation-aware video retrieval model.
Identity-Aware Multi-Sentence Video Description
This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description.
Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources.
Delving Deeper into the Decoder for Video Captioning
Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence.