Video Description
34 papers with code • 0 benchmarks • 8 datasets
The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.
Source: Joint Event Detection and Description in Continuous Video Streams
Benchmarks
These leaderboards are used to track progress in Video Description
Datasets
Most implemented papers
Describing Videos by Exploiting Temporal Structure
In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions.
Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7
Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos.
End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features
We introduce a new dataset of dialogs about videos of human behaviors.
Grounded Video Description
Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research
DVS is an audio narration describing the visual elements and actions in a movie for the visually impaired.
TGIF: A New Dataset and Benchmark on Animated GIF Description
The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.
Video Description using Bidirectional Recurrent Neural Networks
Although traditionally used in the machine translation field, the encoder-decoder framework has been recently applied for the generation of video and image descriptions.
A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection
Although these approaches provide good results, they generally have the disadvantage of a high false positive rate since not all images with large areas of skin exposure are necessarily pornographic images, such as people wearing swimsuits or images related to sports.