Video Understanding

293 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Libraries

Use these libraries to find Video Understanding models and implementations

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

895
22 Mar 2024

Language Repository for Long Video Understanding

kkahatapitiya/langrepo 21 Mar 2024

In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i. e., all-textual) representation.

15
21 Mar 2024

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

buxiangzhiren/vd-it 18 Mar 2024

We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding.

13
18 Mar 2024

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

opengvlab/video-mamba-suite 14 Mar 2024

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

98
14 Mar 2024

Don't Judge by the Look: Towards Motion Coherent Video Representation

bespontaneous/mca-pytorch 14 Mar 2024

Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice.

3
14 Mar 2024

VideoMamba: State Space Model for Efficient Video Understanding

opengvlab/videomamba 11 Mar 2024

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

560
11 Mar 2024

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

pkunlp-icler/fastv 11 Mar 2024

To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones.

126
11 Mar 2024

World Model on Million-Length Video And Language With Blockwise RingAttention

LargeWorldModel/LWM 13 Feb 2024

To address these challenges, we curate a large dataset of diverse videos and books, utilize the Blockwise RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens.

6,783
13 Feb 2024

Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning

netflix/videoannotator 9 Feb 2024

High-quality and consistent annotations are fundamental to the successful development of robust machine learning models.

16
09 Feb 2024

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

jy0205/lavit 5 Feb 2024

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos.

334
05 Feb 2024