Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions.
Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy.
The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain.
The goal of this work is to understand the way actions are performed in videos.
Ranked #2 on Video-Adverb Retrieval on HowTo100M Adverbs
A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding.
In practice, a given video can contain multiple valid positive annotations for the same action.
We refer to this task as Procedure Segmentation and Summarization (PSS).
We propose to learn what makes a good video for action recognition and select only high-quality samples for augmentation.
Ranked #2 on Few Shot Action Recognition on HMDB51
We address the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost.
We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation.
By integrating the SGC and GPA together, we propose the Adaptive Superpixel-guided Network (ASGNet), which is a lightweight model and adapts to object scale and shape variation.
Ranked #55 on Few-Shot Semantic Segmentation on COCO-20i (5-shot)
Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes.
Ranked #2 on Zero-Shot Action Recognition on Olympics
In this work, however, we focus on the more standard short, trimmed action recognition problem.
Ranked #3 on Action Recognition on UCF101
We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time.
The workshop was held in conjunction with the International Conference on Learning Representations (ICLR) 2020.
However, in current video datasets it has been observed that action classes can often be recognized without any temporal information from a single frame of video.
FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities.
Ranked #24 on Action Recognition on UCF101
Motion has shown to be useful for video understanding, where motion is typically represented by optical flow.
Ranked #1 on Action Recognition on UCF-101
Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better.
Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes.
Ranked #10 on Optical Flow Estimation on Sintel-clean
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow.