SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

19 Nov 2021  ยท  Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han ยท

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.

PDF Abstract

Datasets


Introduced in the Paper:

SLUE

Used in the Paper:

VoxCeleb1 ATIS ASR-GLUE
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Named Entity Recognition (NER) SLUE W2V2-L-LL60K (pipeline approach, uses LM) F1 (%) 69.6 # 1
label-F1 (%) 82.2 # 1
Text model DeBERTa-L # 1
Speech Recognition SLUE W2V2-B-VP100K VoxPopuli (Dev) 21.6 # 8
VoxPopuli (Test) 22.4 # 8
VoxCeleb (Dev) 29.9 # 8
VoxCeleb (Test) 33.4 # 8
Sentiment Analysis SLUE W2V2-L-LL60K (e2e approach) Recall (%) 49.2 # 5
F1 (%) 48.5 # 5
Text model N/A # 1
Sentiment Analysis SLUE HuBERT-B-LS960 (e2e approach) Recall (%) 47.5 # 6
F1 (%) 48.0 # 6
Text model N/A # 1
Sentiment Analysis SLUE W2V2-B-VP100K (e2e approach) Recall (%) 38.7 # 8
F1 (%) 38.4 # 8
Text model N/A # 1
Sentiment Analysis SLUE W2V2-B-LS960 (e2e approach) Recall (%) 46.0 # 7
F1 (%) 46.6 # 7
Text model N/A # 1
Sentiment Analysis SLUE W2V2-L-LL60K (pipeline approach, uses LM) Recall (%) 60.4 # 1
F1 (%) 63.3 # 1
Text model DeBERTa-L # 1
Sentiment Analysis SLUE W2V2-B-LS960 (pipeline approach, uses LM) Recall (%) 60.0 # 3
F1 (%) 62.9 # 3
Text model DeBERTa-L # 1
Sentiment Analysis SLUE W2V2-L-LL60K (pipeline approach) Recall (%) 60.2 # 2
F1 (%) 63.3 # 1
Text model DeBERTa-L # 1
Sentiment Analysis SLUE W2V2-B-LS960 (pipeline approach) Recall (%) 59.0 # 4
F1 (%) 61.8 # 4
Text model DeBERTa-L # 1
Named Entity Recognition (NER) SLUE W2V2-L-LL60K (e2e approach) F1 (%) 50.9 # 9
label-F1 (%) 64.7 # 9
Text model - # 1
Named Entity Recognition (NER) SLUE HuBERT-B-LS960 (e2e approach) F1 (%) 49.8 # 11
label-F1 (%) 62.9 # 11
Text model - # 1
Named Entity Recognition (NER) SLUE W2V2-B-VP100K (e2e approach) F1 (%) 47.9 # 13
label-F1 (%) 60.8 # 12
Text model - # 1
Named Entity Recognition (NER) SLUE W2V2-B-LS960 (e2e approach) F1 (%) 50.2 # 10
label-F1 (%) 64.0 # 10
Text model - # 1
Named Entity Recognition (NER) SLUE W2V2-B-LS960 (pipeline approach, uses LM) F1 (%) 68.0 # 2
label-F1 (%) 79.8 # 2
Text model DeBERTa-L # 1
Named Entity Recognition (NER) SLUE W2V2-L-LL60K (pipeline approach) F1 (%) 57.8 # 8
label-F1 (%) 78.8 # 3
Text model DeBERTa-L # 1
Named Entity Recognition (NER) SLUE W2V2-B-LS960 (pipeline approach) F1 (%) 49.5 # 12
label-F1 (%) 74.2 # 4
Text model DeBERTa-L # 1
Named Entity Recognition (NER) SLUE HuBERT-B-LS960 (e2e approach, uses LM) F1 (%) 61.9 # 6
label-F1 (%) 70.3 # 7
Text model N/A # 1
Named Entity Recognition (NER) SLUE W2V2-B-VP100K (e2e approach, uses LM) F1 (%) 61.8 # 7
label-F1 (%) 69.8 # 8
Text model N/A # 1
Named Entity Recognition (NER) SLUE W2V2-L-LL60K (e2e approach, uses LM) F1 (%) 64.8 # 4
label-F1 (%) 73.3 # 5
Text model N/A # 1
Named Entity Recognition (NER) SLUE W2V2-B-LS960 (e2e approach, uses LM) F1 (%) 63.4 # 5
label-F1 (%) 71.7 # 6
Text model N/A # 1
Speech Recognition SLUE W2V2-L-LL60K (+ TED-LIUM 3 LM) VoxPopuli (Dev) 9.1 # 1
VoxPopuli (Test) 9.3 # 1
VoxCeleb (Dev) 9.1 # 1
VoxCeleb (Test) 10.8 # 1
Speech Recognition SLUE W2V2-L-LL60K (+ in-domain LM) VoxPopuli (Dev) 12.0 # 2
VoxPopuli (Test) 12.5 # 4
VoxCeleb (Dev) 11.8 # 3
VoxCeleb (Test) 13.8 # 3
Speech Recognition SLUE W2V2-B-LS960 (+ TED-LIUM 3 LM) VoxPopuli (Dev) 12.0 # 2
VoxPopuli (Test) 12.2 # 3
VoxCeleb (Dev) 13.2 # 4
VoxCeleb (Test) 15.8 # 4
Speech Recognition SLUE W2V2-B-LS960 (+ in-domain LM) VoxPopuli (Dev) 14.6 # 5
VoxPopuli (Test) 15.2 # 5
VoxCeleb (Dev) 15.2 # 5
VoxCeleb (Test) 18.2 # 5
Speech Recognition SLUE W2V2-L-LL60K VoxPopuli (Dev) 14.0 # 4
VoxPopuli (Test) 12.1 # 2
VoxCeleb (Dev) 11.0 # 2
VoxCeleb (Test) 13.5 # 2
Speech Recognition SLUE HuBERT-B-LS960 VoxPopuli (Dev) 18.6 # 7
VoxPopuli (Test) 19.1 # 7
VoxCeleb (Dev) 19.6 # 7
VoxCeleb (Test) 21.2 # 7
Speech Recognition SLUE W2V2-B-LS960 VoxPopuli (Dev) 17.2 # 6
VoxPopuli (Test) 17.9 # 6
VoxCeleb (Dev) 17.2 # 6
VoxCeleb (Test) 20.5 # 6

Methods


No methods listed for this paper. Add relevant methods here