Search Results for author: Lingyu Kong

Found 10 papers, 6 papers with code

Clapper: Compact Learning and Video Representation in VLMs

no code implementations21 May 2025 Lingyu Kong, Hongzhi Zhang, Jingyuan Zhang, Jianzhao Huang, Kunze Li, Qi Wang, Fuzheng Zhang

Designing VLMs for video inputs requires effectively modeling the temporal dimension (i. e. capturing dependencies across frames) and balancing the processing of short and long videos.

Video Understanding

MatterTune: An Integrated, User-Friendly Platform for Fine-Tuning Atomistic Foundation Models to Accelerate Materials Simulation and Discovery

no code implementations14 Apr 2025 Lingyu Kong, Nima Shoghi, Guoxiang Hu, Pan Li, Victor Fung

Conversely, this also results in high data requirements for these models, hindering their application to problems which are data sparse which are common in this domain.

Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

1 code implementation23 Jan 2025 Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu

The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation.

Panoptic Segmentation Text Generation

Focus Anywhere for Fine-grained Multi-page Document Understanding

1 code implementation23 May 2024 Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages.

document understanding Optical Character Recognition (OCR)

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

1 code implementation15 Apr 2024 Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information.

Decoder

Small Language Model Meets with Reinforced Vision Vocabulary

no code implementations23 Jan 2024 Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang

In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality.

Language Modeling Language Modelling +5

Merlin:Empowering Multimodal LLMs with Foresight Minds

no code implementations30 Nov 2023 En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao

Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them.

Visual Question Answering

Cannot find the paper you are looking for? You can Submit a new open access paper.