Video Understanding
300 papers with code • 0 benchmarks • 42 datasets
A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.
Benchmarks
These leaderboards are used to track progress in Video Understanding
Libraries
Use these libraries to find Video Understanding models and implementationsDatasets
Subtasks
Latest papers with no code
OW-VISCap: Open-World Video Instance Segmentation and Captioning
To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video.
A Unified Framework for Human-centric Point Cloud Video Understanding
Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications.
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
In everyday communication, humans frequently use speech and gestures to refer to specific areas or objects, a process known as Referential Dialogue (RD).
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs.
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.
Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions
While existing video editing tasks are limited to changes in attributes, backgrounds, and styles, our method aims to predict open-ended human action changes in video.
Beyond MOT: Semantic Multi-Object Tracking
Current multi-object tracking (MOT) aims to predict trajectories of targets (i. e.,"where") in videos.
A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives
Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once.
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
The development of multimodal models has marked a significant step forward in how machines understand videos.