Multimodal Machine Translation

35 papers with code • 3 benchmarks • 5 datasets

Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text.

( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Machine Translation

Dataset	Best Model	Compare
Multi30K	ERNIE-UniX2	See all
Hindi Visual Genome (Test Set)	ViTA	See all
Hindi Visual Genome (Challenge Set)	ViTA	See all

Libraries

Use these libraries to find Multimodal Machine Translation models and implementations

facebookresearch/seamless_communica…

2 papers

10,242

lium-lst/nmtpy

2 papers

126

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

BERTGEN: Multi-task Generation through BERT

ImperialNLP/BertGen • • ACL 2021

We present BERTGEN, a novel generative, decoder-only model which extends BERT by fusing multimodal and multilingual pretrained models VL-BERT and M-BERT, respectively.

Paper
Code

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

jiaodali/vision-matters-when-it-should • • EMNLP 2021

Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available.

Paper
Code

VISA: An Ambiguous Subtitles Dataset for Visual Scene-Aware Machine Translation

ku-nlp/visa • LREC 2022

Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations.

Paper
Code

MSCTD: A Multimodal Sentiment Chat Translation Dataset

xl2248/msctd • • ACL 2022

In this work, we introduce a new task named Multimodal Chat Translation (MCT), aiming to generate more accurate translations with the help of the associated dialogue history and visual context.

Paper
Code

Neural Machine Translation with Phrase-Level Universal Visual Representations

ictnlp/pluvr • • ACL 2022

Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs.

Paper
Code

VALHALLA: Visual Hallucination for Machine Translation

jerryyli/valhalla-nmt • • CVPR 2022

In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation.

Paper
Code

Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation

pengr/ikd-mmt • • 10 Oct 2022

Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.

Paper
Code

Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination

scofield7419/ummt-vsh • • 20 May 2023

In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs.

Paper
Code

BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation

deeplearnxmu/bigvideo-vmt • • 23 May 2023

We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation.

Paper
Code

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

devaansh100/cliptrans • • ICCV 2023

Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability.

Paper
Code

Multimodal Machine Translation

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result