Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities.
Our empirical analysis shows that our diffusion-based approach is comparable to or outperforming other previous methods for layout generation across various document datasets.
Despite several successes in document understanding, the practical task for long document understanding is largely under-explored due to several challenges in computation and how to efficiently absorb long multimodal input.
We study the problem of recognizing structured text, i. e. text that follows certain formats, and propose to improve the recognition accuracy of structured text by specifying regular expressions (regexes) for biasing.
Text recognition is a long-standing research problem for document digitalization.
Ranked #3 on Handwritten Text Recognition on IAM
In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding.
Ranked #13 on Document Image Classification on RVL-CDIP
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents.
Ranked #1 on Key Information Extraction on SROIE
Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.
It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video.
no code implementations • 1 Mar 2020 • David Pickup, Xianfang Sun, Paul L. Rosin, Ralph R. Martin, Z Cheng, Zhouhui Lian, Masaki Aono, A. Ben Hamza, A Bronstein, M Bronstein, S Bu, Umberto Castellani, S Cheng, Valeria Garro, Andrea Giachetti, Afzal Godil, Luca Isaia, J. Han, Henry Johan, L Lai, Bo Li, C. Li, Haisheng Li, Roee Litman, X. Liu, Z Liu, Yijuan Lu, L. Sun, G Tam, Atsushi Tatsuma, J. Ye
In addition, further participants have also taken part, and we provide extra analysis of the retrieval results.
RGB-Thermal (RGB-T) object tracking receives more and more attention due to the strongly complementary benefits of thermal information to visible data.