Temporal Action Localization
422 papers with code • 14 benchmarks • 42 datasets
Temporal Action Localization aims to detect activities in the video stream and output beginning and end timestamps. It is closely related to Temporal Action Proposal Generation.
Libraries
Use these libraries to find Temporal Action Localization models and implementationsDatasets
Subtasks
Latest papers
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).
Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision
These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i. e., joints, bones, and motions are used, hence no additional modalities are explored.
Temporal Action Localization with Enhanced Instant Discriminability
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
CDFSL-V: Cross-Domain Few-Shot Learning for Videos
To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains.
B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition
Human Action Recognition plays a driving engine of many human-computer interaction applications.
POCO: 3D Pose and Shape Estimation with Confidence
To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass.
HR-Pro: Point-supervised Temporal Action Localization via Hierarchical Reliability Propagation
For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class.
DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action Recognition
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based human action recognition.
Video BagNet: short temporal receptive fields increase robustness in long-term action recognition
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF).
UnLoc: A Unified Framework for Video Localization Tasks
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.