1 code implementation • 28 Mar 2024 • Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang
Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions.
no code implementations • CVPR 2023 • Zhibo Yang, Rujiao Long, Pengfei Wang, Sibo Song, Humen Zhong, Wenqing Cheng, Xiang Bai, Cong Yao
As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common.
2 code implementations • CVPR 2022 • Sibo Song, Jianqiang Wan, Zhibo Yang, Jun Tang, Wenqing Cheng, Xiang Bai, Cong Yao
In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language.
no code implementations • 22 Aug 2018 • Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, Bappaditya Mandal
Specifically, using frame-level features, DATP regresses importance of different temporal segments and generates weights for them.
no code implementations • 6 Aug 2018 • Sibo Song, Yueru Chen, Ngai-Man Cheung, C. -C. Jay Kuo
Therefore, we propose a Saak transform based preprocessing method with three steps: 1) transforming an input image to a joint spatial-spectral representation via the forward Saak transform, 2) apply filtering to its high-frequency components, and, 3) reconstructing the image via the inverse Saak transform.
1 code implementation • 17 Jun 2017 • Zhe Wang, Kingsley Kuan, Mathieu Ravaut, Gaurav Manek, Sibo Song, Yuan Fang, Seokhwan Kim, Nancy Chen, Luis Fernando D'Haro, Luu Anh Tuan, Hongyuan Zhu, Zeng Zeng, Ngai Man Cheung, Georgios Piliouras, Jie Lin, Vijay Chandrasekhar
Beyond that, we extend the original competition by including text information in the classification, making this a truly multi-modal approach with vision, audio and text.
no code implementations • 8 Jan 2017 • Yiren Zhou, Sibo Song, Ngai-Man Cheung
Image blur and image noise are common distortions during image acquisition.
no code implementations • 25 Jan 2016 • Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, Bappaditya Mandal, Jie Lin
With the increasing availability of wearable devices, research on egocentric activity recognition has received much attention recently.