We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

PDF Abstract ACL (GEM) 2021 PDF ACL (GEM) 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text Simplification ASSET T5 METEOR 0.581 # 1
Text Simplification ASSET BART METEOR 0.560 # 2
Data-to-Text Generation Cleaned E2E NLG Challenge TGen METEOR (Validation set) 0.391 # 2
Data-to-Text Generation Cleaned E2E NLG Challenge T5 METEOR (Validation set) 0.369 # 4
Data-to-Text Generation Cleaned E2E NLG Challenge LSTM METEOR (Validation set) 0.394 # 1
Data-to-Text Generation Cleaned E2E NLG Challenge BART METEOR (Validation set) 0.373 # 3
Text Generation CommonGen T5 METEOR 0.291 # 2
Text Generation CommonGen BART METEOR 0.301 # 1
Text Generation Czech restaurant information TGen++ METEOR 0.167 # 1
Text Generation Czech restaurant information TGen+ METEOR 0.151 # 3
Text Generation Czech restaurant information TGen METEOR 0.152 # 2
Text Generation DART BART METEOR 0.107 # 3
Text Generation DART T5 METEOR 0.115 # 2
Extreme Summarization GEM-XSum PEGASUS ROUGE-2 23.2 # 1
Parameters 568 M # 1
Abstractive Text Summarization MLSUM de mBART METEOR 0.437 # 1
Abstractive Text Summarization MLSUM es mBART METEOR 0.210 # 1
Task-Oriented Dialogue Systems SGD BART METEOR 0.089 # 2
Task-Oriented Dialogue Systems SGD T5 METEOR 0.331 # 1
Data-to-Text Generation ToTTo T5 METEOR 0.363 # 1
Text Simplification TurkCorpus T5 METEOR 0.649 # 1
Text Simplification TurkCorpus BART METEOR 0.556 # 2
Data-to-Text Generation WebNLG en mT5 METEOR 0.287 # 2
Data-to-Text Generation WebNLG en mBART METEOR 0.462 # 1
Data-to-Text Generation WebNLG ru mT5 METEOR 0.180 # 2
Data-to-Text Generation WebNLG ru mBART METEOR 0.613 # 1
Cross-Lingual Abstractive Summarization WikiLingua (es->en) mBART METEOR 0.178 # 2
Cross-Lingual Abstractive Summarization WikiLingua (es->en) mBART+ METEOR 0.196 # 1
Cross-Lingual Abstractive Summarization WikiLingua (ru->en) mBART METEOR 0.153 # 2
Cross-Lingual Abstractive Summarization WikiLingua (ru->en) mBART+ METEOR 0.174 # 1
Cross-Lingual Abstractive Summarization WikiLingua (tr->en) mBART+ METEOR 0.204 # 1
Cross-Lingual Abstractive Summarization WikiLingua (tr->en) mBART METEOR 0.164 # 2
Cross-Lingual Abstractive Summarization WikiLingua (vi->en) mBART+ METEOR 0.183 # 1
Cross-Lingual Abstractive Summarization WikiLingua (vi->en) mBART METEOR 0.150 # 2
Extreme Summarization XSum PEGASUS METEOR 0.216 # 1

Methods


No methods listed for this paper. Add relevant methods here