Multimodal Machine Translation

35 papers with code • 3 benchmarks • 5 datasets

Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text.

( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )

Libraries

Use these libraries to find Multimodal Machine Translation models and implementations

Most implemented papers

BERTGEN: Multi-task Generation through BERT

ImperialNLP/BertGen ACL 2021

We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models VL-BERT and M-BERT, respectively.

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

jiaodali/vision-matters-when-it-should EMNLP 2021

Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.

VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation

ku-nlp/visa LREC 2022

Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations.

MSCTD: A Multimodal Sentiment Chat Translation Dataset

xl2248/msctd ACL 2022

In this work, we introduce a new task named Multimodal Chat Translation (MCT), aiming to generate more accurate translations with the help of the associated dialogue history and visual context.

Neural Machine Translation with Phrase-Level Universal Visual Representations

ictnlp/pluvr ACL 2022

Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs.

VALHALLA: Visual Hallucination for Machine Translation

jerryyli/valhalla-nmt CVPR 2022

In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation.

Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation

pengr/ikd-mmt 10 Oct 2022

Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.

Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination

scofield7419/ummt-vsh 20 May 2023

In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs.

BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation

deeplearnxmu/bigvideo-vmt 23 May 2023

We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation.

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

devaansh100/cliptrans ICCV 2023

Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability.