TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Key Information Extraction	CORD	LayoutLMv3 Large	F1	97.46	# 2
Named Entity Recognition (NER)	CORD-r	LayoutLMv3	F1	82.72	# 3
Key Information Extraction	EPHOIE	LayoutLMv3	Average F1	99.21	# 1
Document AI	EPHOIE	LayoutLMv3	Average F1	99.21	# 1
Relation Extraction	FUNSD	LayoutLMv3 large	F1	80.35	# 2
Semantic entity labeling	FUNSD	LayoutLMv3 Large	F1	92.08	# 5
Named Entity Recognition (NER)	FUNSD-r	LayoutLMv3	F1	78.77	# 2
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	Text	0.945	# 5
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	Title	0.906	# 5
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	List	0.955	# 5
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	Table	0.979	# 3
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	Figure	0.970	# 4
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	Overall	0.951	# 5
Document Image Classification	RVL-CDIP	LayoutLMV3Large	Accuracy	95.93%	# 4
Document Image Classification	RVL-CDIP	LayoutLMV3Large	Parameters	368M	# 29
Document Image Classification	RVL-CDIP	LayoutLMv3BASE	Accuracy	95.44%	# 9
Document Image Classification	RVL-CDIP	LayoutLMv3BASE	Parameters	133M	# 20

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/key-information-extraction-on-ephoie)](https://paperswithcode.com/sota/key-information-extraction-on-ephoie?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/document-ai-on-ephoie)](https://paperswithcode.com/sota/document-ai-on-ephoie?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/key-information-extraction-on-cord)](https://paperswithcode.com/sota/key-information-extraction-on-cord?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/relation-extraction-on-funsd)](https://paperswithcode.com/sota/relation-extraction-on-funsd?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/named-entity-recognition-ner-on-funsd-r)](https://paperswithcode.com/sota/named-entity-recognition-ner-on-funsd-r?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/named-entity-recognition-ner-on-cord-r)](https://paperswithcode.com/sota/named-entity-recognition-ner-on-cord-r?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/document-image-classification-on-rvl-cdip)](https://paperswithcode.com/sota/document-image-classification-on-rvl-cdip?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/semantic-entity-labeling-on-funsd)](https://paperswithcode.com/sota/semantic-entity-labeling-on-funsd?p=layoutlmv3-pre-training-for-document-ai-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutlmv3-pre-training-for-document-ai-with/document-layout-analysis-on-publaynet-val)](https://paperswithcode.com/sota/document-layout-analysis-on-publaynet-val?p=layoutlmv3-pre-training-for-document-ai-with)`

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

18 Apr 2022 · Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei ·

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.

PDF Abstract

Code

Add Remove Mark official

microsoft/unilm official

18,328

huggingface/transformers

125,059

Tasks

Add Remove

Document AI

Document Image Classification

Document Layout Analysis

Image Classification

Key Information Extraction

Language Modelling

Masked Language Modeling

Named Entity Recognition (NER)

Question Answering

Relation Extraction

Representation Learning

Semantic entity labeling

Visual Question Answering

Visual Question Answering (VQA)

Datasets

FUNSD DocVQA PubLayNet

RVL-CDIP CORD

EPHOIE

CORD-r

FUNSD-r

Results from the Paper

Edit

Ranked #1 on Key Information Extraction on EPHOIE

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Key Information Extraction	CORD	LayoutLMv3 Large	F1	97.46	# 2	Compare
Named Entity Recognition (NER)	CORD-r	LayoutLMv3	F1	82.72	# 3	Compare
Key Information Extraction	EPHOIE	LayoutLMv3	Average F1	99.21	# 1	Compare
Document AI	EPHOIE	LayoutLMv3	Average F1	99.21	# 1	Compare
Relation Extraction	FUNSD	LayoutLMv3 large	F1	80.35	# 2	Compare
Semantic entity labeling	FUNSD	LayoutLMv3 Large	F1	92.08	# 5	Compare
Named Entity Recognition (NER)	FUNSD-r	LayoutLMv3	F1	78.77	# 2	Compare
Document Layout Analysis	PubLayNet val	LayoutLMv3-B	Text	0.945	# 5	Compare
			Title	0.906	# 5	Compare
			List	0.955	# 5	Compare
			Table	0.979	# 3	Compare
			Figure	0.970	# 4	Compare
			Overall	0.951	# 5	Compare
Document Image Classification	RVL-CDIP	LayoutLMV3Large	Accuracy	95.93%	# 4	Compare
Document Image Classification	RVL-CDIP	LayoutLMV3Large	Parameters	368M	# 29	Compare
Document Image Classification	RVL-CDIP	LayoutLMv3BASE	Accuracy	95.44%	# 9	Compare
Document Image Classification	RVL-CDIP	LayoutLMv3BASE	Parameters	133M	# 20	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove