no code implementations • 21 May 2025 • Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang
Designing VLMs for video inputs requires effectively modeling the temporal dimension (i. e. capturing dependencies across frames) and balancing the processing of short and long videos.
no code implementations • 14 Apr 2025 • Lingyu Kong, Nima Shoghi, Guoxiang Hu, Pan Li, Victor Fung
Conversely, this also results in high data requirements for these models, hindering their application to problems which are data sparse which are common in this domain.
1 code implementation • 23 Jan 2025 • Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu
The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation.
1 code implementation • 9 Jan 2025 • Zuyao You, Lingyu Kong, Lingchen Meng, Zuxuan Wu
Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks.
Ranked #1 on
Salient Object Detection
on PASCAL-S
1 code implementation • 3 Sep 2024 • Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang
As an OCR-2. 0 model, GOT can handle all the above "characters" under various OCR tasks.
1 code implementation • 23 May 2024 • Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang
Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages.
1 code implementation • 15 Apr 2024 • Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang
To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information.
no code implementations • 23 Jan 2024 • Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang
In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality.
Ranked #213 on
Visual Question Answering
on MM-Vet
1 code implementation • 11 Dec 2023 • Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang
Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs.
Ranked #148 on
Visual Question Answering
on MM-Vet
no code implementations • 30 Nov 2023 • En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao
Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them.
Ranked #164 on
Visual Question Answering
on MM-Vet