Temporal Localization
55 papers with code • 0 benchmarks • 3 datasets
Benchmarks
These leaderboards are used to track progress in Temporal Localization
Libraries
Use these libraries to find Temporal Localization models and implementationsLatest papers
LITA: Language Instructed Temporal-Localization Assistant
In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task.
Skeleton-Based Human Action Recognition with Noisy Labels
In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark.
Semi-supervised Active Learning for Video Action Detection
First, we demonstrate its effectiveness on video action detection where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB-21.
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding.
UnLoc: A Unified Framework for Video Localization Tasks
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.
VideoGLUE: Video General Understanding Evaluation of Foundation Models
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task.
Dense Video Object Captioning from Disjoint Supervision
We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.
Self-Chained Image-Language Model for Video Localization and Question Answering
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
Unsupervised classification to improve the quality of a bird song recording dataset
We first showed that the segmentation of bird songs alone aggregated from 10% to 83% of label noise depending on the species.
Multi-Task Learning of Object State Changes from Uncurated Videos
We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.