Video Description

25 papers with code • 0 benchmarks • 7 datasets

The goal of automatic Video Description is to tell a story about events happening in a video. While early Video Description methods produced captions for short clips that were manually segmented to contain a single event of interest, more recently dense video captioning has been proposed to both segment distinct events in time and describe them in a series of coherent sentences. This problem is a generalization of dense image region captioning and has many practical applications, such as generating textual summaries for the visually impaired, or detecting and describing important events in surveillance footage.

Source: Joint Event Detection and Description in Continuous Video Streams

Latest papers with no code

X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model

no code yet • 7 Apr 2024

The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

no code yet • 29 Feb 2024

Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

no code yet • 23 Jan 2024

Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news.

ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition

no code yet • 22 Jan 2024

With the proposed ActionHub dataset, we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module.

Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)

no code yet • 12 Dec 2023

Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain.

Multi Sentence Description of Complex Manipulation Action Videos

no code yet • 13 Nov 2023

Automatic video description requires the generation of natural language statements about the actions, events, and objects in the video.

CLearViD: Curriculum Learning for Video Description

no code yet • 8 Nov 2023

We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task.

Analyzing Political Figures in Real-Time: Leveraging YouTube Metadata for Sentiment Analysis

no code yet • 28 Sep 2023

Sentiment analysis using big data from YouTube videos metadata can be conducted to analyze public opinions on various political figures who represent political parties.

Edit As You Wish: Video Description Editing with Multi-grained Commands

no code yet • 15 May 2023

In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests.

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

no code yet • 28 Dec 2021

Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance.