Action Segmentation

99 papers with code • 9 benchmarks • 18 datasets

Action Segmentation is a challenging problem in high-level video understanding. In its simplest form, Action Segmentation aims to segment a temporally untrimmed video by time and label each segmented part with one of pre-defined action labels. The results of Action Segmentation can be further used as input to various applications, such as video-to-text and action localization.

Source: TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation

Libraries

Use these libraries to find Action Segmentation models and implementations
2 papers
31,616

Most implemented papers

Temporal Convolutional Networks for Action Segmentation and Detection

colincsl/TemporalConvolutionalNetworks CVPR 2017

The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond.

Temporal Action Segmentation: An Analysis of Modern Techniques

atlas-eccv22/awesome-temporal-action-segmentation 19 Oct 2022

Temporal action segmentation (TAS) in videos aims at densely identifying video frames in minutes-long videos with multiple action classes.

Hierarchical NeuroSymbolic Approach for Comprehensive and Explainable Action Quality Assessment

laurenok24/nsaqa 20 Mar 2024

We found that domain experts prefer our system and find it more informative than purely neural approaches to AQA in diving.

MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

yabufarha/ms-tcn CVPR 2019

Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics.

Unsupervised learning of action classes with continuous temporal embedding

annusha/unsup_temp_embed CVPR 2019

The task of temporally detecting and segmenting actions in untrimmed videos has seen an increased attention recently.

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

microsoft/UniVL 15 Feb 2020

However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.

Global2Local: Efficient Structure Search for Video Action Segmentation

ShangHua-Gao/G2L-search CVPR 2021

Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combination patterns further.

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

pytorch/fairseq EMNLP 2021

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

RF-Next: Efficient Receptive Field Search for Convolutional Neural Networks

ShangHua-Gao/RFNext 14 Jun 2022

Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combinations further.