LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at https://aka.ms/layoutxlm.
PDF AbstractCode
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Key-value Pair Extraction | RFUND-EN | LayoutXLM_base | key-value pair F1 | 53.98 | # 9 | |
Document Image Classification | RVL-CDIP | LayoutXLM | Accuracy | 95.21% | # 13 | |
Key-value Pair Extraction | SIBR | LayoutXLM | key-value pair F1 | 70.45 | # 6 |