M&M Mix: A Multimodal Multiview Transformer Ensemble

20 Jun 2022  ·  Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid ·

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

PDF Abstract

Results from the Paper


Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition EPIC-KITCHENS-100 M&M (WTS 60M) Action@1 53.6 # 2
Verb@1 72.0 # 4
Noun@1 66.3 # 1

Methods