AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

30 May 2019Michael S. RyooAJ PiergiovanniMingxing TanAnelia Angelova

Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to a third dimension (using a limited number of space-time modules such as 3D convolutions) or by introducing a handcrafted two-stream design to capture both appearance and motion in videos... (read more)

PDF Abstract

Evaluation results from the paper


 SOTA for Action Classification on Charades (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric name Metric value Global rank Uses extra
training data
Compare
Action Classification Charades AssembleNet MAP 51.6 # 1
Action Recognition In Videos Charades AssembleNet MAP 51.6 # 1
Action Classification Moments in Time AssembleNet Top 1 Accuracy 31.02% # 3
Action Classification Moments in Time AssembleNet Top 5 Accuracy 57.38% # 2