To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module.
Ranked #5 on Referring Expression Segmentation on A2D Sentences
This report presents the algorithm used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR 2022.
To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence.
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
In this work, we design a Cascaded Temporal Attention Network (CASTANET) for GEBD, which is formed by three parts, the backbone network, the temporal attention module, and the classification module.
Ranked #1 on Boundary Detection on Kinetics-400