no code implementations • 13 Apr 2023 • Yihao Ding, Siwen Luo, Hyunsuk Chung, Soyeon Caren Han
Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions.
no code implementations • 16 Dec 2022 • Feiqi Cao, Siwen Luo, Felipe Nunez, Zean Wen, Josiah Poon, Caren Han
To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention.
no code implementations • 29 Nov 2022 • Zhihao Zhang, Siwen Luo, Junyi Chen, Sijia Lai, Siqu Long, Hyunsuk Chung, Soyeon Caren Han
We propose a PiggyBack, a Visual Question Answering platform that allows users to apply the state-of-the-art visual-language pretrained models easily.
1 code implementation • COLING 2022 • Siwen Luo, Yihao Ding, Siqu Long, Josiah Poon, Soyeon Caren Han
Recognizing the layout of unstructured digital documents is crucial when parsing the documents into the structured, machine-readable format for downstream applications.
no code implementations • 20 Mar 2021 • Siwen Luo, Hamish Ivison, Caren Han, Josiah Poon
As the use of deep learning techniques has grown across various fields over the past decade, complaints about the opaqueness of the black-box models have increased, resulting in an increased focus on transparency in deep learning models.
no code implementations • 20 Feb 2021 • Siwen Luo, Mengting Wu, Yiwen Gong, Wanying Zhou, Josiah Poon
The main contributions of this paper are proposing the Financial Documents dataset with table-area annotations, the superior detection model and the rule-based layout segmentation technique for the tabular data extraction from PDF files.
1 code implementation • COLING 2020 • Caren Han, Siqu Long, Siwen Luo, Kunze Wang, Josiah Poon
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input.
1 code implementation • 7 Oct 2020 • Soyeon Caren Han, Siqu Long, Siwen Luo, Kunze Wang, Josiah Poon
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input.
Ranked #24 on
Text-to-Image Generation
on COCO
(Inception score metric)
1 code implementation • 27 Jul 2020 • Siwen Luo, Soyeon Caren Han, Kaiyuan Sun, Josiah Poon
Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer.