TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Natural Language Inference	V-SNLI	MMBT	Accuracy	90.5	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/supervised-multimodal-bitransformers-for/natural-language-inference-on-v-snli)](https://paperswithcode.com/sota/natural-language-inference-on-v-snli?p=supervised-multimodal-bitransformers-for)`

Supervised Multimodal Bitransformers for Classifying Images and Text

6 Sep 2019 · Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine ·

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

PDF Abstract