Action Localization

135 papers with code • 0 benchmarks • 3 datasets

Action Localization is finding the spatial and temporal co ordinates for an action in a video. An action localization model will identify which frame an action start and ends in video and return the x,y coordinates of an action. Further the co ordinates will change when the object performing action undergoes a displacement.

Libraries

Use these libraries to find Action Localization models and implementations

Most implemented papers

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

harlanhong/MM2021-CO2-Net 27 Jul 2021

In this work, we argue that the features extracted from the pretrained extractor, e. g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy.

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

pytorch/fairseq EMNLP 2021

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

Structured Attention Composition for Temporal Action Localization

vividle/online-action-detection 20 May 2022

To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality.

Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge

happyharrycn/actionformer_release 16 Nov 2022

This report describes our submission to the Ego4D Moment Queries Challenge 2022.

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

zzxslp/mm-navigator 13 Nov 2023

We first benchmark MM-Navigator on our collected iOS screen dataset.

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

adapt-python/adapt CVPR 2014

We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

zhengshou/AutoLoc 4 Apr 2015

To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

zhengshou/scnn CVPR 2016

To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance.

VideoLSTM Convolves, Attends and Flows for Action Recognition

zhenyangli/VideoLSTM 6 Jul 2016

We present a new architecture for end-to-end sequence learning of actions in video, we call VideoLSTM.