TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Response Generation	SIMMC2.0	MTN	BLEU	21.7	# 4
Dialogue State Tracking	SIMMC2.0	MTN	Slot F1	76.7	# 5
Dialogue State Tracking	SIMMC2.0	MTN	Act F1	93.4	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-transformer-networks-for-end-to/response-generation-on-simmc2-0)](https://paperswithcode.com/sota/response-generation-on-simmc2-0?p=multimodal-transformer-networks-for-end-to)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-transformer-networks-for-end-to/dialogue-state-tracking-on-simmc2-0)](https://paperswithcode.com/sota/dialogue-state-tracking-on-simmc2-0?p=multimodal-transformer-networks-for-end-to)`

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

ACL 2019 · Hung Le, Doyen Sahoo, Nancy F. Chen, Steven C. H. Hoi ·

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.

PDF Abstract ACL 2019 PDF ACL 2019 Abstract

Code

Add Remove Mark official

henryhungle/MTN official

Tasks

Add Remove

Dialogue State Tracking

Response Generation

Datasets

SIMMC2.0

Results from the Paper

Edit

Ranked #4 on Response Generation on SIMMC2.0

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Response Generation	SIMMC2.0	MTN	BLEU	21.7	# 4	Compare
Dialogue State Tracking	SIMMC2.0	MTN	Slot F1	76.7	# 5	Compare
Dialogue State Tracking	SIMMC2.0	MTN	Act F1	93.4	# 5	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove