Cross-Stage Transformer for Video Learning

29 Sep 2021  ·  Yuanze Lin, Xun Guo, Yan Lu ·

Transformer network has been proved efficient in modeling long-range dependencies in video learning. However, videos contain rich contextual information in both spatial and temporal dimensions, e.g., scenes and temporal reasoning. In traditional transformer networks, stacked transformer blocks work in a sequential and independent way, which may lead to the inefficient propagation of such contextual information. To address this problem, we propose a cross-stage transformer paradigm, which allows to fuse self-attentions and features from different blocks. By inserting the proposed cross-stage mechanism in existing spatial and temporal transformer blocks, we build a separable transformer network for video learning based on ViT structure, in which self-attentions and features are progressively aggregated from one block to the next. Extensive experiments show that our approach outperforms existing ViT based video transformer approaches with the same pre-training dataset on mainstream video action recognition datasets of Kinetics-400 (Top-1 accuracy 81.8%) and Kinetics-600 (Top-1 accuracy 84.0%). Due to the effectiveness of cross-stage transformer, our proposed method achieves comparable performance with other ViT based approaches with much lower computation cost (e.g., 8.6% of ViViT’s FLOPs) in inference process. As an independent module, our proposed method can be conveniently added on other video transformer frameworks.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here