Video Understanding
294 papers with code • 0 benchmarks • 42 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Subtasks
Latest papers with no code
Leveraging Temporal Contextualization for Video Action Recognition
We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process.
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition
Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions.
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis
Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.
A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos
Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important.
Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention
We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input.
SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
Koala: Key frame-conditioned long video-LLM
Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.
OW-VISCap: Open-World Video Instance Segmentation and Captioning
To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video.
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding.
A Unified Framework for Human-centric Point Cloud Video Understanding
Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications.