GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

WS 2018 Alex WangAmanpreet SinghJulian MichaelFelix HillOmer LevySamuel R. Bowman

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks... (read more)

PDF Abstract

Evaluation Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK COMPARE
Natural Language Inference MultiNLI Multi-task BiLSTM + Attn Matched 72.2 # 15
Natural Language Inference MultiNLI Multi-task BiLSTM + Attn Mismatched 72.1 # 14