PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

24 May 2023  ยท  Yunshui Li, Binyuan Hui, Zhichao Yin, Min Yang, Fei Huang, Yongbin Li ยท

Perceiving multi-modal information and fulfilling dialogues with humans is a long-term goal of artificial intelligence. Pre-training is commonly regarded as an effective approach for multi-modal dialogue. However, due to the limited availability of multi-modal dialogue data, there is still scarce research on multi-modal dialogue pre-training. Yet another intriguing challenge emerges from the encompassing nature of multi-modal dialogue, which involves various modalities and tasks. Moreover, new forms of tasks may arise at unpredictable points in the future. Hence, it is essential for designed multi-modal dialogue models to possess sufficient flexibility to adapt to such scenarios. This paper proposes \textbf{PaCE}, a unified, structured, compositional multi-modal dialogue pre-training framework. It utilizes a combination of several fundamental experts to accommodate multiple dialogue-related tasks and can be pre-trained using limited dialogue and extensive non-dialogue multi-modal data. Furthermore, we propose a progressive training method where old experts from the past can assist new experts, facilitating the expansion of their capabilities. Experimental results demonstrate that PaCE achieves state-of-the-art results on eight multi-modal dialog benchmarks.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text Retrieval Image-Chat PaCE R@1 51.9 # 1
R@5 76.8 # 1
Sum(R@1,5) 128.7 # 1
Response Generation MMConv PaCE Inform 34.5 # 1
Success 13.9 # 1
BLEU 22 # 1
Comb. 44.7 # 1
Dialogue State Tracking MMConv PaCE Categorical Accuracy 92.2 # 1
Non-Categorical Accuracy 43.4 # 1
Overall 39.2 # 1
Multimodal Intent Recognition MMDialog PaCE F1 77.6 # 1
Multimodal Intent Recognition PhotoChat PaCE F1 63.8 # 1
Precision 63.3 # 1
Recall 68 # 1
Image Retrieval PhotoChat PaCE R1 15.2 # 1
R@5 36.7 # 1
R@10 49.6 # 1
Sum(R@1,5,10) 101.5 # 1
Dialogue State Tracking SIMMC2.0 PaCE Slot F1 87.0 # 2
Act F1 97.1 # 1
Response Generation SIMMC2.0 PaCE BLEU 34.1 # 1

Methods


No methods listed for this paper. Add relevant methods here