What Makes Training Multi-Modal Classification Networks Hard?

CVPR 2020  ·  Wei-Yao Wang, Du Tran, Matt Feiszli ·

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract

Results from the Paper


 Ranked #1 on Action Recognition on miniSports (Video hit@1 metric)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Classification Kinetics-400 G-Blend (Sports-1M pretrain) Vid acc@1 78.9 # 61
Action Classification Kinetics-400 G-Blend Vid acc@1 77.7 # 73
Action Recognition miniSports G-Blend Video hit@1 62.8 # 1
Video hit@5 85.5 # 1
Clip Hit@1 49.7 # 1
Action Recognition Sports-1M G-Blend Video hit@1 74.8 # 3
Video hit@5 92.4 # 3

Methods


No methods listed for this paper. Add relevant methods here