Video Classification

172 papers with code • 11 benchmarks • 17 datasets

Video Classification is the task of producing a label that is relevant to the video given its frames. A good video level classifier is one that not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else (e.g., “hiking”). The granularity of the labels that are needed to describe the frames and the video depends on the task. Typical tasks include assigning one or more global labels to the video, and assigning one or more labels for each frame inside the video.

Source: Efficient Large Scale Video Classification

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Classification

Dataset	Best Model	Compare
Breakfast	MA-LMM	See all
COIN	MA-LMM	See all
YouTube-8M	DCGN (self-attention graph pooling)	See all
MoB	VTN	See all
Hockey Fight Detection Dataset	CNN+LSTM	See all
Kinetics	Multigrid	See all
Charades	Multigrid	See all
Something-Something V1	MSNet-R50En (ours)	See all
Something-Something V2	MSNet-R50En (ours)	See all
Multimodal PISA	MMDL	See all
Home Action Genome	Cooperative Ours (3rd-person)	See all

Show all 11 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Classification models and implementations

open-mmlab/mmaction2

6 papers

3,876

rwightman/pytorch-image-models

3 papers

29,680

facebookresearch/detectron

2 papers

26,140

open-mmlab/mmclassification

2 papers

3,140

See all 6 libraries.

Datasets

Latest papers

Most implemented Social Latest No code

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM • • 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

105

08 Apr 2024

Paper
Code

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

annusha/xmic • 28 Mar 2024

Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition.

28 Mar 2024

Paper
Code

Multi-modality transrectal ultrasound video classification for identification of clinically significant prostate cancer

2313595986/prostatetrus • • 14 Feb 2024

With the aim of effectively identifying prostate cancer, we propose a framework for the classification of clinically significant prostate cancer (csPCa) from multi-modality TRUS videos.

14 Feb 2024

Paper
Code

Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning

netflix/videoannotator • 9 Feb 2024

High-quality and consistent annotations are fundamental to the successful development of robust machine learning models.

09 Feb 2024

Paper
Code

FakeClaim: A Multiple Platform-driven Dataset for Identification of Fake News on 2023 Israel-Hamas War

gautamshahi/fakeclaim • • 29 Jan 2024

We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification.

29 Jan 2024

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

714

21 Dec 2023

Paper
Code

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach

qinying-liu/case • • ICCV 2023

It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background.

21 Dec 2023

Paper
Code

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

tacju/maxtron • • 30 Nov 2023

To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively.

30 Nov 2023

Paper
Code

Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments

calvintanama/qd-driver-activity-reco • • 10 Nov 2023

The framework enhances 3D MobileNet, a neural architecture optimized for speed in video classification, by incorporating knowledge distillation and model quantization to balance model accuracy and computational efficiency.

10 Nov 2023

Paper
Code

Differentiable Resolution Compression and Alignment for Efficient Video Classification and Retrieval

dun-research/drca • • 15 Sep 2023

To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment mechanism, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations.

15 Sep 2023

Paper
Code

Video Classification

Benchmarks Add a Result

Libraries

Datasets

Latest papers

Content

Benchmarks

Add a Result