Multimodal Intent Recognition
8 papers with code • 3 benchmarks • 3 datasets
Intent recognition on multimodal content.
Image source: MIntRec: A New Dataset for Multimodal Intent Recognition
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks.
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP).
Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
First, it is the largest multi-modal conversation dataset by the number of dialogues by 88x.
In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model.
It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data.