2 papers with code •
• 1 datasets
5 Feb 2021
Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages.