Zero-Shot Action Detection
3 papers with code • 2 benchmarks • 2 datasets
Most implemented papers
Prompting Visual-Language Models for Efficient Video Understanding
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation.
Zero-Shot Temporal Action Detection via Vision-Language Prompting
Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between.
UnLoc: A Unified Framework for Video Localization Tasks
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.