TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Multimodal Sentiment Analysis	CMU-MOSEI	SPECTRA	Accuracy	87.34	# 3
Multimodal Sentiment Analysis	CMU-MOSI	SPECTRA	Acc-2	87.5	# 1
Emotion Recognition in Conversation	IEMOCAP	SPECTRA	Accuracy	67.94	# 15
Multimodal Intent Recognition	MIntRec	SPECTRA	Accuracy (20 classes)	73.48	# 3
Multimodal Sentiment Analysis	MOSI	SPECTRA	Accuracy	87.50	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/speech-text-dialog-pre-training-for-spoken/multimodal-sentiment-analysis-on-cmu-mosi)](https://paperswithcode.com/sota/multimodal-sentiment-analysis-on-cmu-mosi?p=speech-text-dialog-pre-training-for-spoken)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/speech-text-dialog-pre-training-for-spoken/multimodal-sentiment-analysis-on-mosi)](https://paperswithcode.com/sota/multimodal-sentiment-analysis-on-mosi?p=speech-text-dialog-pre-training-for-spoken)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/speech-text-dialog-pre-training-for-spoken/multimodal-sentiment-analysis-on-cmu-mosei-1)](https://paperswithcode.com/sota/multimodal-sentiment-analysis-on-cmu-mosei-1?p=speech-text-dialog-pre-training-for-spoken)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/speech-text-dialog-pre-training-for-spoken/multimodal-intent-recognition-on-mintrec)](https://paperswithcode.com/sota/multimodal-intent-recognition-on-mintrec?p=speech-text-dialog-pre-training-for-spoken)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/speech-text-dialog-pre-training-for-spoken/emotion-recognition-in-conversation-on)](https://paperswithcode.com/sota/emotion-recognition-in-conversation-on?p=speech-text-dialog-pre-training-for-spoken)`

Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

19 May 2023 · Tianshu Yu, Haoyu Gao, Ting-En Lin, Min Yang, Yuchuan Wu, Wentao Ma, Chao Wang, Fei Huang, Yongbin Li ·

Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.

PDF Abstract

Code

Add Remove Mark official

alibabaresearch/damo-convai official

958

Tasks

Add Remove

Emotion Recognition in Conversation

Multimodal Intent Recognition

Multimodal Sentiment Analysis

Datasets

IEMOCAP

CMU-MOSEI

Multimodal Opinionlevel Sentiment Intensity

CMU-MOSI

MIntRec

Results from the Paper

Edit

Ranked #1 on Multimodal Sentiment Analysis on MOSI

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Multimodal Sentiment Analysis	CMU-MOSEI	SPECTRA	Accuracy	87.34	# 3	Compare
Multimodal Sentiment Analysis	CMU-MOSI	SPECTRA	Acc-2	87.5	# 1	Compare
Emotion Recognition in Conversation	IEMOCAP	SPECTRA	Accuracy	67.94	# 15	Compare
Multimodal Intent Recognition	MIntRec	SPECTRA	Accuracy (20 classes)	73.48	# 3	Compare
Multimodal Sentiment Analysis	MOSI	SPECTRA	Accuracy	87.50	# 1	Compare

Methods

Add Remove

fail

Edit Social Preview

Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove