TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text Classification	MuLD (Character Type)	T5	F1	54.01	# 2
Text Classification	MuLD (Character Type)	Longformer	F1	82.58	# 1
Question Answering	MuLD (HotpotQA)	Longformer	BLEU-1	30.38	# 1
Question Answering	MuLD (HotpotQA)	Longformer	BLEU-4	16.76	# 1
Question Answering	MuLD (HotpotQA)	Longformer	Rouge-L	30.49	# 1
Question Answering	MuLD (HotpotQA)	Longformer	METEOR	4.98	# 1
Question Answering	MuLD (HotpotQA)	T5	BLEU-1	28.11	# 2
Question Answering	MuLD (HotpotQA)	T5	BLEU-4	13.63	# 2
Question Answering	MuLD (HotpotQA)	T5	Rouge-L	27.61	# 2
Question Answering	MuLD (HotpotQA)	T5	METEOR	4.46	# 2
Question Answering	MuLD (NarrativeQA)	T5	BLEU-1	17.67	# 2
Question Answering	MuLD (NarrativeQA)	T5	BLEU-4	55	# 2
Question Answering	MuLD (NarrativeQA)	T5	Rouge-L	19.03	# 2
Question Answering	MuLD (NarrativeQA)	T5	METEOR	3.36	# 2
Question Answering	MuLD (NarrativeQA)	Longformer	BLEU-1	19.84	# 1
Question Answering	MuLD (NarrativeQA)	Longformer	BLEU-4	62	# 1
Question Answering	MuLD (NarrativeQA)	Longformer	Rouge-L	22.09	# 1
Question Answering	MuLD (NarrativeQA)	Longformer	METEOR	4.52	# 1
Translation	MuLD (OpenSubtitles)	T5	BLEU-1	34.07	# 1
Translation	MuLD (OpenSubtitles)	T5	BLEU-4	1.63	# 2
Translation	MuLD (OpenSubtitles)	T5	Rouge-L	35.35	# 1
Translation	MuLD (OpenSubtitles)	T5	METEOR	38.53	# 1
Translation	MuLD (OpenSubtitles)	Longformer	BLEU-1	22.74	# 2
Translation	MuLD (OpenSubtitles)	Longformer	BLEU-4	20	# 1
Translation	MuLD (OpenSubtitles)	Longformer	Rouge-L	22.17	# 2
Translation	MuLD (OpenSubtitles)	Longformer	METEOR	22.95	# 2
Style change detection	MuLD (Style Change)	T5	F1	26.49	# 2
Style change detection	MuLD (Style Change)	Longformer	F1	28.17	# 1
Summarization	MuLD (VLSP)	T5	BLEU-1	28.85	# 2
Summarization	MuLD (VLSP)	T5	BLEU-4	84	# 1
Summarization	MuLD (VLSP)	T5	Rouge-L	16.55	# 2
Summarization	MuLD (VLSP)	T5	METEOR	7.98	# 2
Summarization	MuLD (VLSP)	Longformer	BLEU-1	46.74	# 1
Summarization	MuLD (VLSP)	Longformer	BLEU-4	3.05	# 2
Summarization	MuLD (VLSP)	Longformer	Rouge-L	19.52	# 1
Summarization	MuLD (VLSP)	Longformer	METEOR	9.58	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/muld-the-multitask-long-document-benchmark/text-classification-on-muld-character-type)](https://paperswithcode.com/sota/text-classification-on-muld-character-type?p=muld-the-multitask-long-document-benchmark)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/muld-the-multitask-long-document-benchmark/question-answering-on-muld-hotpotqa)](https://paperswithcode.com/sota/question-answering-on-muld-hotpotqa?p=muld-the-multitask-long-document-benchmark)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/muld-the-multitask-long-document-benchmark/question-answering-on-muld-narrativeqa)](https://paperswithcode.com/sota/question-answering-on-muld-narrativeqa?p=muld-the-multitask-long-document-benchmark)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/muld-the-multitask-long-document-benchmark/translation-on-muld-opensubtitles)](https://paperswithcode.com/sota/translation-on-muld-opensubtitles?p=muld-the-multitask-long-document-benchmark)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/muld-the-multitask-long-document-benchmark/style-change-detection-on-muld-style-change)](https://paperswithcode.com/sota/style-change-detection-on-muld-style-change?p=muld-the-multitask-long-document-benchmark)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/muld-the-multitask-long-document-benchmark/summarization-on-muld-vlsp)](https://paperswithcode.com/sota/summarization-on-muld-vlsp?p=muld-the-multitask-long-document-benchmark)`

MuLD: The Multitask Long Document Benchmark

LREC 2022 · G Thomas Hudson, Noura Al Moubayed ·

The impressive progress in NLP techniques has been driven by the development of multi-task benchmarks such as GLUE and SuperGLUE. While these benchmarks focus on tasks for one or two input sentences, there has been exciting work in designing efficient techniques for processing much longer inputs. In this paper, we present MuLD: a new long document benchmark consisting of only documents over 10,000 tokens. By modifying existing NLP tasks, we create a diverse benchmark which requires models to successfully model long-term dependencies in the text. We evaluate how existing models perform, and find that our benchmark is much more challenging than their `short document' equivalents. Furthermore, by evaluating both regular and efficient transformers, we show that models with increased context length are better able to solve the tasks presented, suggesting that future improvements in these models are vital for solving similar long document problems. We release the data and code for baselines to encourage further research on efficient NLP models.

PDF Abstract LREC 2022 PDF LREC 2022 Abstract

Code

Add Remove Mark official

ghomashudson/muld official

Tasks

Add Remove

Question Answering

Style change detection

Summarization

Text Classification

Translation

Datasets

Introduced in the Paper:

MuLD

Used in the Paper:

NarrativeQA

Results from the Paper

Edit

Ranked #1 on Translation on MuLD (OpenSubtitles)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text Classification	MuLD (Character Type)	T5	F1	54.01	# 2	Compare
Text Classification	MuLD (Character Type)	Longformer	F1	82.58	# 1	Compare
Question Answering	MuLD (HotpotQA)	Longformer	BLEU-1	30.38	# 1	Compare
			BLEU-4	16.76	# 1	Compare
			Rouge-L	30.49	# 1	Compare
			METEOR	4.98	# 1	Compare
Question Answering	MuLD (HotpotQA)	T5	BLEU-1	28.11	# 2	Compare
			BLEU-4	13.63	# 2	Compare
			Rouge-L	27.61	# 2	Compare
			METEOR	4.46	# 2	Compare
Question Answering	MuLD (NarrativeQA)	T5	BLEU-1	17.67	# 2	Compare
			BLEU-4	55	# 2	Compare
			Rouge-L	19.03	# 2	Compare
			METEOR	3.36	# 2	Compare
Question Answering	MuLD (NarrativeQA)	Longformer	BLEU-1	19.84	# 1	Compare
			BLEU-4	62	# 1	Compare
			Rouge-L	22.09	# 1	Compare
			METEOR	4.52	# 1	Compare
Translation	MuLD (OpenSubtitles)	T5	BLEU-1	34.07	# 1	Compare
			BLEU-4	1.63	# 2	Compare
			Rouge-L	35.35	# 1	Compare
			METEOR	38.53	# 1	Compare
Translation	MuLD (OpenSubtitles)	Longformer	BLEU-1	22.74	# 2	Compare
			BLEU-4	20	# 1	Compare
			Rouge-L	22.17	# 2	Compare
			METEOR	22.95	# 2	Compare
Style change detection	MuLD (Style Change)	T5	F1	26.49	# 2	Compare
Style change detection	MuLD (Style Change)	Longformer	F1	28.17	# 1	Compare
Summarization	MuLD (VLSP)	T5	BLEU-1	28.85	# 2	Compare
			BLEU-4	84	# 1	Compare
			Rouge-L	16.55	# 2	Compare
			METEOR	7.98	# 2	Compare
Summarization	MuLD (VLSP)	Longformer	BLEU-1	46.74	# 1	Compare
			BLEU-4	3.05	# 2	Compare
			Rouge-L	19.52	# 1	Compare
			METEOR	9.58	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MuLD: The Multitask Long Document Benchmark

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove