Video Understanding

293 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Understanding

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Libraries

Use these libraries to find Video Understanding models and implementations

open-mmlab/mmaction2

7 papers

3,866

towhee-io/towhee

4 papers

2,968

google-research/scenic

2 papers

2,983

MIT-HAN-LAB/temporal-shift-module

2 papers

2,015

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo • • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

895

22 Mar 2024

Paper
Code

Language Repository for Long Video Understanding

kkahatapitiya/langrepo • • 21 Mar 2024

In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i. e., all-textual) representation.

21 Mar 2024

Paper
Code

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

buxiangzhiren/vd-it • • 18 Mar 2024

We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding.

18 Mar 2024

Paper
Code

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

opengvlab/video-mamba-suite • • 14 Mar 2024

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

14 Mar 2024

Paper
Code

Don't Judge by the Look: Towards Motion Coherent Video Representation

bespontaneous/mca-pytorch • • 14 Mar 2024

Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice.

14 Mar 2024

Paper
Code

VideoMamba: State Space Model for Efficient Video Understanding

opengvlab/videomamba • • 11 Mar 2024

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

560

11 Mar 2024

Paper
Code

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

pkunlp-icler/fastv • • 11 Mar 2024

To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones.

126

11 Mar 2024

Paper
Code

World Model on Million-Length Video And Language With Blockwise RingAttention

LargeWorldModel/LWM • • 13 Feb 2024

To address these challenges, we curate a large dataset of diverse videos and books, utilize the Blockwise RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens.

6,783

13 Feb 2024

Paper
Code

Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning

netflix/videoannotator • 9 Feb 2024

High-quality and consistent annotations are fundamental to the successful development of robust machine learning models.

09 Feb 2024

Paper
Code

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

jy0205/lavit • • 5 Feb 2024

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos.

334

05 Feb 2024

Paper
Code

Video Understanding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result