LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

18 Apr 2022  ·  Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei ·

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at \url{https://aka.ms/layoutlmv3}.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Key Information Extraction CORD LayoutLMv3 Large F1 97.46 # 2
Named Entity Recognition (NER) CORD-r LayoutLMv3 F1 82.72 # 3
Key Information Extraction EPHOIE LayoutLMv3 Average F1 99.21 # 1
Document AI EPHOIE LayoutLMv3 Average F1 99.21 # 1
Relation Extraction FUNSD LayoutLMv3 large F1 80.35 # 2
Semantic entity labeling FUNSD LayoutLMv3 Large F1 92.08 # 5
Named Entity Recognition (NER) FUNSD-r LayoutLMv3 F1 78.77 # 2
Document Layout Analysis PubLayNet val LayoutLMv3-B Text 0.945 # 5
Title 0.906 # 5
List 0.955 # 5
Table 0.979 # 3
Figure 0.970 # 4
Overall 0.951 # 5
Document Image Classification RVL-CDIP LayoutLMV3Large Accuracy 95.93% # 4
Parameters 368M # 29
Document Image Classification RVL-CDIP LayoutLMv3BASE Accuracy 95.44% # 9
Parameters 133M # 20

Methods


No methods listed for this paper. Add relevant methods here