Video Understanding

300 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Understanding

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Libraries

Use these libraries to find Video Understanding models and implementations

open-mmlab/mmaction2

7 papers

3,904

towhee-io/towhee

4 papers

3,000

google-research/scenic

2 papers

3,008

MIT-HAN-LAB/temporal-shift-module

2 papers

2,018

Datasets

Subtasks

Latest papers with no code

Most implemented Social Latest No code

OW-VISCap: Open-World Video Instance Segmentation and Captioning

no code yet • 4 Apr 2024

To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video.

Paper
Add Code

A Unified Framework for Human-centric Point Cloud Video Understanding

no code yet • 29 Mar 2024

Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications.

Paper
Add Code

AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

no code yet • 24 Mar 2024

In everyday communication, humans frequently use speech and gestures to refer to specific areas or objects, a process known as Referential Dialogue (RD).

Paper
Add Code

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

no code yet • 21 Mar 2024

In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs.

Paper
Add Code

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

no code yet • 18 Mar 2024

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.

Paper
Add Code

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

no code yet • 15 Mar 2024

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.

Paper
Add Code

Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions

no code yet • 11 Mar 2024

While existing video editing tasks are limited to changes in attributes, backgrounds, and styles, our method aims to predict open-ended human action changes in video.

Paper
Add Code

Beyond MOT: Semantic Multi-Object Tracking

no code yet • 8 Mar 2024

Current multi-object tracking (MOT) aims to predict trajectories of targets (i. e.,"where") in videos.

Paper
Add Code

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

no code yet • 5 Mar 2024

Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once.

Paper
Add Code

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

no code yet • 3 Mar 2024

The development of multimodal models has marked a significant step forward in how machines understand videos.

Paper
Add Code

Video Understanding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result