Video Understanding

294 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Understanding

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Libraries

Use these libraries to find Video Understanding models and implementations

open-mmlab/mmaction2

7 papers

3,866

towhee-io/towhee

4 papers

2,972

google-research/scenic

2 papers

2,988

MIT-HAN-LAB/temporal-shift-module

2 papers

2,015

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection

edengabriel/taskweave • • 14 Apr 2024

Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied.

14 Apr 2024

Paper
Code

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM • • 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

100

08 Apr 2024

Paper
Code

LongVLM: Efficient Long Video Understanding via Large Language Models

ziplab/longvlm • 4 Apr 2024

In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.

04 Apr 2024

Paper
Code

SnAG: Scalable and Accurate Video Grounding

happyharrycn/actionformer_release • • 2 Apr 2024

In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models.

381

02 Apr 2024

Paper
Code

R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

yeliudev/R2-Tuning • • 2 Apr 2024

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries.

02 Apr 2024

Paper
Code

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

yeliudev/R2-Tuning • • 31 Mar 2024

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries.

31 Mar 2024

Paper
Code

ST-LLM: Large Language Models Are Effective Temporal Learners

TencentARC/ST-LLM • • 30 Mar 2024

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

30 Mar 2024

Paper
Code

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

lancopku/mr-vpc • 28 Mar 2024

Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries.

28 Mar 2024

Paper
Code

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid • • 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

26 Mar 2024

Paper
Code

Understanding Long Videos in One Multimodal Language Model Pass

kahnchana/mvu • • 25 Mar 2024

In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information.

25 Mar 2024

Paper
Code

Video Understanding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result