Video Understanding

300 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Libraries

Use these libraries to find Video Understanding models and implementations

Latest papers with no code

OW-VISCap: Open-World Video Instance Segmentation and Captioning

no code yet • 4 Apr 2024

To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video.

A Unified Framework for Human-centric Point Cloud Video Understanding

no code yet • 29 Mar 2024

Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications.

AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

no code yet • 24 Mar 2024

In everyday communication, humans frequently use speech and gestures to refer to specific areas or objects, a process known as Referential Dialogue (RD).

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

no code yet • 21 Mar 2024

In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs.

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

no code yet • 18 Mar 2024

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

no code yet • 15 Mar 2024

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

no code yet • 11 Mar 2024

While existing video editing tasks are limited to changes in attributes, backgrounds, and styles, our method aims to predict open-ended human action changes in video.

Beyond MOT: Semantic Multi-Object Tracking

no code yet • 8 Mar 2024

Current multi-object tracking (MOT) aims to predict trajectories of targets (i. e.,"where") in videos.

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

no code yet • 5 Mar 2024

Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once.

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

no code yet • 3 Mar 2024

The development of multimodal models has marked a significant step forward in how machines understand videos.