TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	F1 (%)	69.6	# 1
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	label-F1 (%)	82.2	# 1
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	Text model	DeBERTa-L	# 1
Speech Recognition	SLUE	W2V2-B-VP100K	VoxPopuli (Dev)	21.6	# 8
Speech Recognition	SLUE	W2V2-B-VP100K	VoxPopuli (Test)	22.4	# 8
Speech Recognition	SLUE	W2V2-B-VP100K	VoxCeleb (Dev)	29.9	# 8
Speech Recognition	SLUE	W2V2-B-VP100K	VoxCeleb (Test)	33.4	# 8
Sentiment Analysis	SLUE	W2V2-L-LL60K (e2e approach)	Recall (%)	49.2	# 5
Sentiment Analysis	SLUE	W2V2-L-LL60K (e2e approach)	F1 (%)	48.5	# 5
Sentiment Analysis	SLUE	W2V2-L-LL60K (e2e approach)	Text model	N/A	# 1
Sentiment Analysis	SLUE	HuBERT-B-LS960 (e2e approach)	Recall (%)	47.5	# 6
Sentiment Analysis	SLUE	HuBERT-B-LS960 (e2e approach)	F1 (%)	48.0	# 6
Sentiment Analysis	SLUE	HuBERT-B-LS960 (e2e approach)	Text model	N/A	# 1
Sentiment Analysis	SLUE	W2V2-B-VP100K (e2e approach)	Recall (%)	38.7	# 8
Sentiment Analysis	SLUE	W2V2-B-VP100K (e2e approach)	F1 (%)	38.4	# 8
Sentiment Analysis	SLUE	W2V2-B-VP100K (e2e approach)	Text model	N/A	# 1
Sentiment Analysis	SLUE	W2V2-B-LS960 (e2e approach)	Recall (%)	46.0	# 7
Sentiment Analysis	SLUE	W2V2-B-LS960 (e2e approach)	F1 (%)	46.6	# 7
Sentiment Analysis	SLUE	W2V2-B-LS960 (e2e approach)	Text model	N/A	# 1
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	Recall (%)	60.4	# 1
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	F1 (%)	63.3	# 1
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	Text model	DeBERTa-L	# 1
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	Recall (%)	60.0	# 3
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	F1 (%)	62.9	# 3
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	Text model	DeBERTa-L	# 1
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach)	Recall (%)	60.2	# 2
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach)	F1 (%)	63.3	# 1
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach)	Text model	DeBERTa-L	# 1
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach)	Recall (%)	59.0	# 4
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach)	F1 (%)	61.8	# 4
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach)	Text model	DeBERTa-L	# 1
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach)	F1 (%)	50.9	# 9
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach)	label-F1 (%)	64.7	# 9
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach)	Text model	-	# 1
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach)	F1 (%)	49.8	# 11
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach)	label-F1 (%)	62.9	# 11
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach)	Text model	-	# 1
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach)	F1 (%)	47.9	# 13
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach)	label-F1 (%)	60.8	# 12
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach)	Text model	-	# 1
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach)	F1 (%)	50.2	# 10
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach)	label-F1 (%)	64.0	# 10
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach)	Text model	-	# 1
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	F1 (%)	68.0	# 2
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	label-F1 (%)	79.8	# 2
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	Text model	DeBERTa-L	# 1
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach)	F1 (%)	57.8	# 8
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach)	label-F1 (%)	78.8	# 3
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach)	Text model	DeBERTa-L	# 1
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach)	F1 (%)	49.5	# 12
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach)	label-F1 (%)	74.2	# 4
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach)	Text model	DeBERTa-L	# 1
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach, uses LM)	F1 (%)	61.9	# 6
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach, uses LM)	label-F1 (%)	70.3	# 7
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach, uses LM)	Text model	N/A	# 1
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach, uses LM)	F1 (%)	61.8	# 7
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach, uses LM)	label-F1 (%)	69.8	# 8
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach, uses LM)	Text model	N/A	# 1
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach, uses LM)	F1 (%)	64.8	# 4
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach, uses LM)	label-F1 (%)	73.3	# 5
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach, uses LM)	Text model	N/A	# 1
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach, uses LM)	F1 (%)	63.4	# 5
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach, uses LM)	label-F1 (%)	71.7	# 6
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach, uses LM)	Text model	N/A	# 1
Speech Recognition	SLUE	W2V2-L-LL60K (+ TED-LIUM 3 LM)	VoxPopuli (Dev)	9.1	# 1
Speech Recognition	SLUE	W2V2-L-LL60K (+ TED-LIUM 3 LM)	VoxPopuli (Test)	9.3	# 1
Speech Recognition	SLUE	W2V2-L-LL60K (+ TED-LIUM 3 LM)	VoxCeleb (Dev)	9.1	# 1
Speech Recognition	SLUE	W2V2-L-LL60K (+ TED-LIUM 3 LM)	VoxCeleb (Test)	10.8	# 1
Speech Recognition	SLUE	W2V2-L-LL60K (+ in-domain LM)	VoxPopuli (Dev)	12.0	# 2
Speech Recognition	SLUE	W2V2-L-LL60K (+ in-domain LM)	VoxPopuli (Test)	12.5	# 4
Speech Recognition	SLUE	W2V2-L-LL60K (+ in-domain LM)	VoxCeleb (Dev)	11.8	# 3
Speech Recognition	SLUE	W2V2-L-LL60K (+ in-domain LM)	VoxCeleb (Test)	13.8	# 3
Speech Recognition	SLUE	W2V2-B-LS960 (+ TED-LIUM 3 LM)	VoxPopuli (Dev)	12.0	# 2
Speech Recognition	SLUE	W2V2-B-LS960 (+ TED-LIUM 3 LM)	VoxPopuli (Test)	12.2	# 3
Speech Recognition	SLUE	W2V2-B-LS960 (+ TED-LIUM 3 LM)	VoxCeleb (Dev)	13.2	# 4
Speech Recognition	SLUE	W2V2-B-LS960 (+ TED-LIUM 3 LM)	VoxCeleb (Test)	15.8	# 4
Speech Recognition	SLUE	W2V2-B-LS960 (+ in-domain LM)	VoxPopuli (Dev)	14.6	# 5
Speech Recognition	SLUE	W2V2-B-LS960 (+ in-domain LM)	VoxPopuli (Test)	15.2	# 5
Speech Recognition	SLUE	W2V2-B-LS960 (+ in-domain LM)	VoxCeleb (Dev)	15.2	# 5
Speech Recognition	SLUE	W2V2-B-LS960 (+ in-domain LM)	VoxCeleb (Test)	18.2	# 5
Speech Recognition	SLUE	W2V2-L-LL60K	VoxPopuli (Dev)	14.0	# 4
Speech Recognition	SLUE	W2V2-L-LL60K	VoxPopuli (Test)	12.1	# 2
Speech Recognition	SLUE	W2V2-L-LL60K	VoxCeleb (Dev)	11.0	# 2
Speech Recognition	SLUE	W2V2-L-LL60K	VoxCeleb (Test)	13.5	# 2
Speech Recognition	SLUE	HuBERT-B-LS960	VoxPopuli (Dev)	18.6	# 7
Speech Recognition	SLUE	HuBERT-B-LS960	VoxPopuli (Test)	19.1	# 7
Speech Recognition	SLUE	HuBERT-B-LS960	VoxCeleb (Dev)	19.6	# 7
Speech Recognition	SLUE	HuBERT-B-LS960	VoxCeleb (Test)	21.2	# 7
Speech Recognition	SLUE	W2V2-B-LS960	VoxPopuli (Dev)	17.2	# 6
Speech Recognition	SLUE	W2V2-B-LS960	VoxPopuli (Test)	17.9	# 6
Speech Recognition	SLUE	W2V2-B-LS960	VoxCeleb (Dev)	17.2	# 6
Speech Recognition	SLUE	W2V2-B-LS960	VoxCeleb (Test)	20.5	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/slue-new-benchmark-tasks-for-spoken-language/named-entity-recognition-on-slue)](https://paperswithcode.com/sota/named-entity-recognition-on-slue?p=slue-new-benchmark-tasks-for-spoken-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/slue-new-benchmark-tasks-for-spoken-language/sentiment-analysis-on-slue)](https://paperswithcode.com/sota/sentiment-analysis-on-slue?p=slue-new-benchmark-tasks-for-spoken-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/slue-new-benchmark-tasks-for-spoken-language/speech-recognition-on-slue)](https://paperswithcode.com/sota/speech-recognition-on-slue?p=slue-new-benchmark-tasks-for-spoken-language)`

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

19 Nov 2021 · Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han ·

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.

PDF Abstract

Code

Add Remove Mark official

asappresearch/slue-toolkit official

Tasks

Add Remove

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

named-entity-recognition

Named Entity Recognition

Named Entity Recognition (NER)

Sentiment Analysis

Speaker Identification

speech-recognition

Speech Recognition

Spoken Language Understanding

Datasets

Introduced in the Paper:

SLUE

Used in the Paper:

VoxCeleb1

ATIS

ASR-GLUE

Results from the Paper

Edit

Ranked #1 on Named Entity Recognition (NER) on SLUE

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	F1 (%)	69.6	# 1	Compare
			label-F1 (%)	82.2	# 1	Compare
			Text model	DeBERTa-L	# 1	Compare
Speech Recognition	SLUE	W2V2-B-VP100K	VoxPopuli (Dev)	21.6	# 8	Compare
			VoxPopuli (Test)	22.4	# 8	Compare
			VoxCeleb (Dev)	29.9	# 8	Compare
			VoxCeleb (Test)	33.4	# 8	Compare
Sentiment Analysis	SLUE	W2V2-L-LL60K (e2e approach)	Recall (%)	49.2	# 5	Compare
			F1 (%)	48.5	# 5	Compare
			Text model	N/A	# 1	Compare
Sentiment Analysis	SLUE	HuBERT-B-LS960 (e2e approach)	Recall (%)	47.5	# 6	Compare
			F1 (%)	48.0	# 6	Compare
			Text model	N/A	# 1	Compare
Sentiment Analysis	SLUE	W2V2-B-VP100K (e2e approach)	Recall (%)	38.7	# 8	Compare
			F1 (%)	38.4	# 8	Compare
			Text model	N/A	# 1	Compare
Sentiment Analysis	SLUE	W2V2-B-LS960 (e2e approach)	Recall (%)	46.0	# 7	Compare
			F1 (%)	46.6	# 7	Compare
			Text model	N/A	# 1	Compare
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach, uses LM)	Recall (%)	60.4	# 1	Compare
			F1 (%)	63.3	# 1	Compare
			Text model	DeBERTa-L	# 1	Compare
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	Recall (%)	60.0	# 3	Compare
			F1 (%)	62.9	# 3	Compare
			Text model	DeBERTa-L	# 1	Compare
Sentiment Analysis	SLUE	W2V2-L-LL60K (pipeline approach)	Recall (%)	60.2	# 2	Compare
			F1 (%)	63.3	# 1	Compare
			Text model	DeBERTa-L	# 1	Compare
Sentiment Analysis	SLUE	W2V2-B-LS960 (pipeline approach)	Recall (%)	59.0	# 4	Compare
			F1 (%)	61.8	# 4	Compare
			Text model	DeBERTa-L	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach)	F1 (%)	50.9	# 9	Compare
			label-F1 (%)	64.7	# 9	Compare
			Text model	-	# 1	Compare
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach)	F1 (%)	49.8	# 11	Compare
			label-F1 (%)	62.9	# 11	Compare
			Text model	-	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach)	F1 (%)	47.9	# 13	Compare
			label-F1 (%)	60.8	# 12	Compare
			Text model	-	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach)	F1 (%)	50.2	# 10	Compare
			label-F1 (%)	64.0	# 10	Compare
			Text model	-	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach, uses LM)	F1 (%)	68.0	# 2	Compare
			label-F1 (%)	79.8	# 2	Compare
			Text model	DeBERTa-L	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (pipeline approach)	F1 (%)	57.8	# 8	Compare
			label-F1 (%)	78.8	# 3	Compare
			Text model	DeBERTa-L	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (pipeline approach)	F1 (%)	49.5	# 12	Compare
			label-F1 (%)	74.2	# 4	Compare
			Text model	DeBERTa-L	# 1	Compare
Named Entity Recognition (NER)	SLUE	HuBERT-B-LS960 (e2e approach, uses LM)	F1 (%)	61.9	# 6	Compare
			label-F1 (%)	70.3	# 7	Compare
			Text model	N/A	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-B-VP100K (e2e approach, uses LM)	F1 (%)	61.8	# 7	Compare
			label-F1 (%)	69.8	# 8	Compare
			Text model	N/A	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-L-LL60K (e2e approach, uses LM)	F1 (%)	64.8	# 4	Compare
			label-F1 (%)	73.3	# 5	Compare
			Text model	N/A	# 1	Compare
Named Entity Recognition (NER)	SLUE	W2V2-B-LS960 (e2e approach, uses LM)	F1 (%)	63.4	# 5	Compare
			label-F1 (%)	71.7	# 6	Compare
			Text model	N/A	# 1	Compare
Speech Recognition	SLUE	W2V2-L-LL60K (+ TED-LIUM 3 LM)	VoxPopuli (Dev)	9.1	# 1	Compare
			VoxPopuli (Test)	9.3	# 1	Compare
			VoxCeleb (Dev)	9.1	# 1	Compare
			VoxCeleb (Test)	10.8	# 1	Compare
Speech Recognition	SLUE	W2V2-L-LL60K (+ in-domain LM)	VoxPopuli (Dev)	12.0	# 2	Compare
			VoxPopuli (Test)	12.5	# 4	Compare
			VoxCeleb (Dev)	11.8	# 3	Compare
			VoxCeleb (Test)	13.8	# 3	Compare
Speech Recognition	SLUE	W2V2-B-LS960 (+ TED-LIUM 3 LM)	VoxPopuli (Dev)	12.0	# 2	Compare
			VoxPopuli (Test)	12.2	# 3	Compare
			VoxCeleb (Dev)	13.2	# 4	Compare
			VoxCeleb (Test)	15.8	# 4	Compare
Speech Recognition	SLUE	W2V2-B-LS960 (+ in-domain LM)	VoxPopuli (Dev)	14.6	# 5	Compare
			VoxPopuli (Test)	15.2	# 5	Compare
			VoxCeleb (Dev)	15.2	# 5	Compare
			VoxCeleb (Test)	18.2	# 5	Compare
Speech Recognition	SLUE	W2V2-L-LL60K	VoxPopuli (Dev)	14.0	# 4	Compare
			VoxPopuli (Test)	12.1	# 2	Compare
			VoxCeleb (Dev)	11.0	# 2	Compare
			VoxCeleb (Test)	13.5	# 2	Compare
Speech Recognition	SLUE	HuBERT-B-LS960	VoxPopuli (Dev)	18.6	# 7	Compare
			VoxPopuli (Test)	19.1	# 7	Compare
			VoxCeleb (Dev)	19.6	# 7	Compare
			VoxCeleb (Test)	21.2	# 7	Compare
Speech Recognition	SLUE	W2V2-B-LS960	VoxPopuli (Dev)	17.2	# 6	Compare
			VoxPopuli (Test)	17.9	# 6	Compare
			VoxCeleb (Dev)	17.2	# 6	Compare
			VoxCeleb (Test)	20.5	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove