We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards... Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate. read more

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text Simplification ASSET T5 METEOR 0.581 # 1
Text Simplification ASSET BART METEOR 0.560 # 2
Data-to-Text Generation Cleaned E2E NLG Challenge LSTM METEOR 0.394 # 1
Data-to-Text Generation Cleaned E2E NLG Challenge T5 METEOR 0.369 # 4
Data-to-Text Generation Cleaned E2E NLG Challenge BART METEOR 0.373 # 3
Data-to-Text Generation Cleaned E2E NLG Challenge TGen METEOR 0.391 # 2
Text Generation CommonGen T5 METEOR 0.291 # 2
Text Generation CommonGen BART METEOR 0.301 # 1
Text Generation Czech restaurant information TGen++ METEOR 0.167 # 1
Text Generation Czech restaurant information TGen METEOR 0.152 # 2
Text Generation Czech restaurant information TGen+ METEOR 0.151 # 3
Text Generation DART BART METEOR 0.107 # 2
Text Generation DART T5 METEOR 0.115 # 1
Abstractive Text Summarization MLSUM de mBART METEOR 0.437 # 1
Abstractive Text Summarization MLSUM es mBART METEOR 0.210 # 1
Task-Oriented Dialogue Systems SGD T5 METEOR 0.331 # 1
Task-Oriented Dialogue Systems SGD BART METEOR 0.089 # 2
Data-to-Text Generation ToTTo T5 METEOR 0.363 # 1
Text Simplification TurkCorpus BART METEOR 0.556 # 2
Text Simplification TurkCorpus T5 METEOR 0.649 # 1
Data-to-Text Generation WebNLG en mBART METEOR 0.462 # 1
Data-to-Text Generation WebNLG en mT5 METEOR 0.287 # 2
Data-to-Text Generation WebNLG ru mBART METEOR 0.613 # 1
Data-to-Text Generation WebNLG ru mT5 METEOR 0.180 # 2
Cross-Lingual Abstractive Summarization WikiLingua (es->en) mBART+ METEOR 0.196 # 1
Cross-Lingual Abstractive Summarization WikiLingua (es->en) mBART METEOR 0.178 # 2
Cross-Lingual Abstractive Summarization WikiLingua (ru->en) mBART+ METEOR 0.174 # 1
Cross-Lingual Abstractive Summarization WikiLingua (ru->en) mBART METEOR 0.153 # 2
Cross-Lingual Abstractive Summarization WikiLingua (tr->en) mBART METEOR 0.164 # 2
Cross-Lingual Abstractive Summarization WikiLingua (tr->en) mBART+ METEOR 0.204 # 1
Cross-Lingual Abstractive Summarization WikiLingua (vi->en) mBART+ METEOR 0.183 # 1
Cross-Lingual Abstractive Summarization WikiLingua (vi->en) mBART METEOR 0.150 # 2
Extreme Summarization XSum PEGASUS METEOR 0.216 # 1

Methods


No methods listed for this paper. Add relevant methods here