What Makes Training Multi-Modal Classification Networks Hard?

CVPR 2020 Weiyao WangDu TranMatt Feiszli

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Action Classification Kinetics-400 G-Blend (Sports-1M pretrain) Accuracy 78.9 # 5
Action Classification Kinetics-400 G-Blend Accuracy 77.7 # 8
Action Recognition miniSports G-Blend Video [email protected] 62.8 # 1
Video [email protected] 85.5 # 1
Clip [email protected] 49.7 # 1
Action Recognition Sports-1M G-Blend Video [email protected] 74.8 # 2
Video [email protected] 92.4 # 2

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet