Supervised Multimodal Bitransformers for Classifying Images and Text

6 Sep 2019  ·  Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine ·

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

PDF Abstract

Results from the Paper

 Ranked #1 on Natural Language Inference on V-SNLI (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Natural Language Inference V-SNLI MMBT Accuracy 90.5 # 1