no code implementations • 22 May 2025 • Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm.
no code implementations • 27 Dec 2024 • Enze Xie, Jiaho Lyu, Daiqing Wu, Huawen Shen, Yu Zhou
Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module.
1 code implementation • 17 Dec 2024 • Yan Zhang, Gangyan Zeng, Huawen Shen, Daiqing Wu, Yu Zhou, Can Ma
Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly reasoning textual and visual information in a given video.
no code implementations • 9 Jul 2024 • Daiqing Wu, Dongbao Yang, Huawen Shen, Can Ma, Yu Zhou
In the semantics completion module, we complement image and text representations with the semantics of the OCR text embedded in the image, helping bridge the sentiment gap.
no code implementations • 24 Mar 2022 • Chengyang Fang, Gangyan Zeng, Yu Zhou, Daiqing Wu, Can Ma, Dayong Hu, Weiping Wang
Texts in scene images convey critical information for scene understanding and reasoning.
Optical Character Recognition
Optical Character Recognition (OCR)
+4