TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM(base, 3M)	Accuracy (%)	65.91±0.40	# 3
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM(base, 4M)	Accuracy (%)	67.17±0.42	# 2
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM-X2VLM-large	Accuracy (%)	74.83	# 1
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM(base, 4M)	Recall@1 (%)	73.46±0.09	# 2
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM(base, 3M)	Recall@1 (%)	65.37±0.10	# 3
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM(base, 3M)	Recall@1 (%)	67.35±0.31	# 3
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM(base, 4M)	Recall@1 (%)	76.56±0.14	# 2
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM-X2VLM-large	Recall@1 (%)	83.78	# 1
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM-X2VLM-large	Recall@1 (%)	83.46	# 1
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM(base, 4M)	Accuracy (%)	46.24±0.21	# 2
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM(base, 3M)	Accuracy (%)	42.36±0.68	# 4
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM-X2VLM-large	Accuracy (%)	56.25	# 1
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM(base, 4M)	Accuracy (%)	73.32 ±0.24	# 4
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM(base, 3M)	Accuracy (%)	74.64 ±0.69	# 3
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM-X2VLM-large	Accuracy (%)	78.95	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-view-language-modeling-towards-unified/zero-shot-cross-lingual-visual-reasoning-on)](https://paperswithcode.com/sota/zero-shot-cross-lingual-visual-reasoning-on?p=cross-view-language-modeling-towards-unified)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-view-language-modeling-towards-unified/zero-shot-cross-lingual-text-to-image)](https://paperswithcode.com/sota/zero-shot-cross-lingual-text-to-image?p=cross-view-language-modeling-towards-unified)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-view-language-modeling-towards-unified/zero-shot-cross-lingual-image-to-text)](https://paperswithcode.com/sota/zero-shot-cross-lingual-image-to-text?p=cross-view-language-modeling-towards-unified)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-view-language-modeling-towards-unified/zero-shot-cross-lingual-visual-question)](https://paperswithcode.com/sota/zero-shot-cross-lingual-visual-question?p=cross-view-language-modeling-towards-unified)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cross-view-language-modeling-towards-unified/zero-shot-cross-lingual-visual-natural)](https://paperswithcode.com/sota/zero-shot-cross-lingual-visual-natural?p=cross-view-language-modeling-towards-unified)`

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

1 Jun 2022 · Yan Zeng, Wangchunshu Zhou, Ao Luo, Ziming Cheng, Xinsong Zhang ·

In this paper, we introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.

PDF Abstract

Code

Add Remove Mark official

zengyan-97/cclm official

Tasks

Add Remove

Contrastive Learning

Cross-Lingual Transfer

Language Modelling

Masked Language Modeling

Retrieval

Sentence

Text Retrieval

Zero-Shot Cross-Lingual Image-to-Text Retrieval

Zero-Shot Cross-Lingual Text-to-Image Retrieval

Zero-Shot Cross-Lingual Visual Natural Language Inference

Zero-Shot Cross-Lingual Visual Question Answering

Zero-Shot Cross-Lingual Visual Reasoning

Datasets

Flickr30k

GQA

WikiMatrix

WIT

IGLUE

MaRVL

Results from the Paper

Edit

Ranked #1 on Zero-Shot Cross-Lingual Visual Question Answering on xGQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM(base, 3M)	Accuracy (%)	65.91±0.40	# 3	Compare
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM(base, 4M)	Accuracy (%)	67.17±0.42	# 2	Compare
Zero-Shot Cross-Lingual Visual Reasoning	MaRVL	CCLM-X2VLM-large	Accuracy (%)	74.83	# 1	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM(base, 4M)	Recall@1 (%)	73.46±0.09	# 2	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM(base, 3M)	Recall@1 (%)	65.37±0.10	# 3	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM(base, 3M)	Recall@1 (%)	67.35±0.31	# 3	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM(base, 4M)	Recall@1 (%)	76.56±0.14	# 2	Compare
Zero-Shot Cross-Lingual Text-to-Image Retrieval	xFlickr&CO	CCLM-X2VLM-large	Recall@1 (%)	83.78	# 1	Compare
Zero-Shot Cross-Lingual Image-to-Text Retrieval	xFlickr&CO	CCLM-X2VLM-large	Recall@1 (%)	83.46	# 1	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM(base, 4M)	Accuracy (%)	46.24±0.21	# 2	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM(base, 3M)	Accuracy (%)	42.36±0.68	# 4	Compare
Zero-Shot Cross-Lingual Visual Question Answering	xGQA	CCLM-X2VLM-large	Accuracy (%)	56.25	# 1	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM(base, 4M)	Accuracy (%)	73.32 ±0.24	# 4	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM(base, 3M)	Accuracy (%)	74.64 ±0.69	# 3	Compare
Zero-Shot Cross-Lingual Visual Natural Language Inference	XVNLI	CCLM-X2VLM-large	Accuracy (%)	78.95	# 1	Compare

Methods

Add Remove

ALIGN

Edit Social Preview

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove