Multimodal Transformer for Unaligned Multimodal Language Sequences

ACL 2019 Yao-Hung Hubert TsaiShaojie BaiPaul Pu LiangJ. Zico KolterLouis-Philippe MorencyRuslan Salakhutdinov

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Multimodal Sentiment Analysis MOSI MulT Accuracy 83% # 2
F1 score 82.8 # 1

Methods used in the Paper