AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

ICLR 2020 Michael S. RyooAJ PiergiovanniMingxing TanAnelia Angelova

Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos... (read more)

PDF Abstract

Evaluation Results from the Paper


 SOTA for Action Classification on Charades (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
COMPARE
Action Classification Charades AssembleNet MAP 51.6 # 1
Action Recognition In Videos Charades AssembleNet MAP 56.6 # 1
Action Classification Moments in Time AssembleNet Top 1 Accuracy 31.02% # 3
Action Classification Moments in Time AssembleNet Top 5 Accuracy 57.38% # 2
Multimodal Activity Recognition Moments in Time Dataset AssembleNet Top-1 (%) 34.27 # 1
Multimodal Activity Recognition Moments in Time Dataset AssembleNet Top-5 (%) 62.71 # 1