Video Understanding

294 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Understanding

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Libraries

Use these libraries to find Video Understanding models and implementations

open-mmlab/mmaction2

7 papers

3,866

towhee-io/towhee

4 papers

2,972

google-research/scenic

2 papers

2,983

MIT-HAN-LAB/temporal-shift-module

2 papers

2,015

Datasets

Subtasks

Latest papers with no code

Most implemented Social Latest No code

Leveraging Temporal Contextualization for Video Action Recognition

no code yet • 15 Apr 2024

We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process.

Paper
Add Code

In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

no code yet • 14 Apr 2024

Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions.

Paper
Add Code

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

no code yet • 12 Apr 2024

Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video.

Paper
Add Code

A Transformer-Based Model for the Prediction of Human Gaze Behavior on Videos

no code yet • 10 Apr 2024

Eye-tracking applications that utilize the human gaze in video understanding tasks have become increasingly important.

Paper
Add Code

Gaze-Guided Graph Neural Network for Action Anticipation Conditioned on Intention

no code yet • 10 Apr 2024

We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input.

Paper
Add Code

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

no code yet • 6 Apr 2024

We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Paper
Add Code

Koala: Key frame-conditioned long video-LLM

no code yet • 5 Apr 2024

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships.

Paper
Add Code

OW-VISCap: Open-World Video Instance Segmentation and Captioning

no code yet • 4 Apr 2024

To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video.

Paper
Add Code

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

no code yet • 4 Apr 2024

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding.

Paper
Add Code

A Unified Framework for Human-centric Point Cloud Video Understanding

no code yet • 29 Mar 2024

Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications.

Paper
Add Code

Video Understanding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers with no code

Content

Benchmarks

Add a Result