LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.

PDF Abstract ACL 2021 PDF ACL 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Key information extraction CORD LayoutLMv2BASE F1 94.95 # 4
Key information extraction CORD LayoutLMv2LARGE F1 96.01 # 3
Visual Question Answering (VQA) DocVQA test LayoutLMv2 ANLS 0.867 # 4
Semantic entity labeling FUNSD LayoutLMv2BASE F1 82.76 # 8
Semantic entity labeling FUNSD LayoutLMv2LARGE F1 84.2 # 7
Key information extraction Kleister NDA LayoutLMv2BASE F1 83.3 # 2
Key information extraction Kleister NDA LayoutLMv2LARGE F1 85.2 # 1
Document Image Classification RVL-CDIP LayoutLMv2LARGE Accuracy 95.64 # 4
Document Image Classification RVL-CDIP LayoutLMv2BASE Accuracy 95.25 # 8
Parameters 200M # 20
Key information extraction SROIE LayoutLMv2LARGE (Excluding OCR mismatch) F1 97.81 # 1
Key information extraction SROIE LayoutLMv2BASE F1 96.25 # 3
Key information extraction SROIE LayoutLMv2LARGE F1 96.61 # 2

Methods