The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.

Source: Joint Event Detection and Description in Continuous Video Streams

Most implemented papers

Describing Videos by Exploiting Temporal Structure

yaoli/arctic-capgen-vid ICCV 2015

In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions.

Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge 1 Jun 2018

Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

TejInaco/multimodalML EMNLP 2016

This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos.

Grounded Video Description

facebookresearch/grounded-video-description CVPR 2019

Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

eric-xw/Video-guided-Machine-Translation ICCV 2019

We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.

Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research

jssprz/video_captioning_datasets 3 Mar 2015

DVS is an audio narration describing the visual elements and actions in a movie for the visually impaired.

TGIF: A New Dataset and Benchmark on Animated GIF Description

raingo/TGIF-Release CVPR 2016

The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.

Video Description using Bidirectional Recurrent Neural Networks

lvapeab/ABiViRNet 12 Apr 2016

Although traditionally used in the machine translation field, the encoder-decoder framework has been recently applied for the generation of video and image descriptions.

A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography Detection

jackaduma/nude-detect 12 May 2016

Although these approaches provide good results, they generally have the disadvantage of a high false positive rate since not all images with large areas of skin exposure are necessarily pornographic images, such as people wearing swimsuits or images related to sports.