Multimodal Machine Translation

34 papers with code • 3 benchmarks • 5 datasets

Multimodal machine translation is the task of doing machine translation with multiple data sources - for example, translating "a bird is flying over water" + an image of a bird over water to German text.

( Image credit: Findings of the Third Shared Task on Multimodal Machine Translation )

Benchmarks

Add a Result

These leaderboards are used to track progress in Multimodal Machine Translation

Dataset	Best Model	Compare
Multi30K	ERNIE-UniX2	See all
Hindi Visual Genome (Test Set)	ViTA	See all
Hindi Visual Genome (Challenge Set)	ViTA	See all

Libraries

Use these libraries to find Multimodal Machine Translation models and implementations

facebookresearch/seamless_communica…

2 papers

10,169

lium-lst/nmtpy

2 papers

126

Datasets

Subtasks

Latest papers

Most implemented Social Latest No code

Seamless: Multilingual Expressive and Streaming Speech Translation

facebookresearch/seamless_communication • • 8 Dec 2023

In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.

10,169

08 Dec 2023

Paper
Code

Video-Helpful Multimodal Machine Translation

ku-nlp/video-helpful-mmt • • 31 Oct 2023

In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation.

31 Oct 2023

Paper
Code

Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

libeineu/mmt-vqa • • 26 Oct 2023

This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete.

26 Oct 2023

Paper
Code

Bridging the Gap between Synthetic and Authentic Images for Multimodal Machine Translation

ictnlp/sammt • • 20 Oct 2023

Multimodal machine translation (MMT) simultaneously takes the source sentence and a relevant image as input for translation.

20 Oct 2023

Paper
Code

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

devaansh100/cliptrans • • ICCV 2023

Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability.

29 Aug 2023

Paper
Code

BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation

deeplearnxmu/bigvideo-vmt • • 23 May 2023

We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation.

23 May 2023

Paper
Code

Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination

scofield7419/ummt-vsh • • 20 May 2023

In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs.

20 May 2023

Paper
Code

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

matthieufp/vgamt • • 20 Dec 2022

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.

20 Dec 2022

Paper
Code

Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation

pengr/ikd-mmt • • 10 Oct 2022

Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.

10 Oct 2022

Paper
Code

VALHALLA: Visual Hallucination for Machine Translation

jerryyli/valhalla-nmt • • CVPR 2022

In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation.

31 May 2022

Paper
Code

Multimodal Machine Translation

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result