TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	DocVQA test	UDOP (aux)	ANLS	0.878	# 10
Visual Question Answering (VQA)	DocVQA test	UDOP	ANLS	0.847	# 16
Visual Question Answering (VQA)	InfographicVQA	UDOP (aux)	ANLS	63.0	# 5
Visual Question Answering (VQA)	InfographicVQA	UDOP	ANLS	47.4	# 15

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-vision-text-and-layout-for-universal/visual-question-answering-vqa-on)](https://paperswithcode.com/sota/visual-question-answering-vqa-on?p=unifying-vision-text-and-layout-for-universal)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unifying-vision-text-and-layout-for-universal/visual-question-answering-on-docvqa-test)](https://paperswithcode.com/sota/visual-question-answering-on-docvqa-test?p=unifying-vision-text-and-layout-for-universal)`

Unifying Vision, Text, and Layout for Universal Document Processing

CVPR 2023 · Zineng Tang, ZiYi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal ·

We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

microsoft/i-code official

1,639

microsoft/udop official

230

Tasks

Add Remove

Document AI

document understanding

Image Reconstruction

Visual Question Answering (VQA)

Datasets

FUNSD DocVQA

RVL-CDIP

TabFact CORD

InfographicVQA

Results from the Paper

Edit

Ranked #5 on Visual Question Answering (VQA) on InfographicVQA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	DocVQA test	UDOP (aux)	ANLS	0.878	# 10	Compare
Visual Question Answering (VQA)	DocVQA test	UDOP	ANLS	0.847	# 16	Compare
Visual Question Answering (VQA)	InfographicVQA	UDOP (aux)	ANLS	63.0	# 5	Compare
Visual Question Answering (VQA)	InfographicVQA	UDOP	ANLS	47.4	# 15	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Unifying Vision, Text, and Layout for Universal Document Processing

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove