AssembleNet++: Assembling Modality Representations via Attention Connections

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality... (read more)

PDF Abstract

Datasets


Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Action Classification Charades AssembleNet++ 50 MAP 59.8 # 2
Action Classification Charades AssembleNet++ 50 without object MAP 54.98 # 5

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet