Video Understanding

294 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Libraries

Use these libraries to find Video Understanding models and implementations

Latest papers with no code

Leveraging Temporal Contextualization for Video Action Recognition

no code yet • 15 Apr 2024

We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process.

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

no code yet • 14 Apr 2024

Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions.

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

no code yet • 12 Apr 2024

Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

no code yet • 10 Apr 2024

Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important.

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

no code yet • 10 Apr 2024

We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input.

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

no code yet • 6 Apr 2024

We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Koala: Key frame-conditioned long video-LLM

no code yet • 5 Apr 2024

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.

OW-VISCap: Open-World Video Instance Segmentation and Captioning

no code yet • 4 Apr 2024

To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video.

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

no code yet • 4 Apr 2024

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding.

A Unified Framework for Human-centric Point Cloud Video Understanding

no code yet • 29 Mar 2024

Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications.