To address these problems, we present a new boundary-aware cascade network by introducing two novel components.
Ranked #6 on Action Segmentation on GTEA
Our challenge includes two tasks: video structuring in the temporal dimension and multi-modal video classification.
Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Dual Matching Network (DMN), to directly model the relations between language queries and video moments in a joint embedding space.
Spatio-temporal action detection is an important and challenging problem in video understanding.