Search Results for author: Can Huang

Found 24 papers, 19 papers with code

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

1 code implementation12 Dec 2024 Han Wang, Yuxiang Nie, YongJie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field.

Computational Efficiency

Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

1 code implementation5 Nov 2024 Yuankai Fan, Tonghui Ren, Can Huang, Zhenying He, X. Sean Wang

Natural Language Interfaces for Databases empower non-technical users to interact with data using natural language (NL).

Translation

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

1 code implementation15 Oct 2024 Bin Shan, Xiang Fei, Wei Shi, An-Lan Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, Can Huang

The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications.

Fairness Scene Text Recognition +1

UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function

no code implementations27 Aug 2024 Zhichao Wang, Bin Bi, Can Huang, Shiva Kumar Pentyala, Zixu James Zhu, Sitaram Asur, Na Claire Cheng

DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF.

ParGo: Bridging Vision-Language with Partial and Global Views

1 code implementation23 Aug 2024 An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, Wei-Shi Zheng

This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs).

MME

Harmonizing Visual Text Comprehension and Generation

1 code implementation23 Jul 2024 Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao liu, Xin Tan, Zhizhong Zhang, Can Huang, Yuan Xie

Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries.

multimodal generation Reading Comprehension +1

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

1 code implementation2 Jul 2024 Jinghui Lu, Haiyang Yu, Yanjie Wang, YongJie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao liu, Can Huang

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks.

document understanding Key Information Extraction +6

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

1 code implementation3 Jun 2024 Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shu Wei, Binghong Wu, Lei Liao, YongJie Ye, Hao liu, Wengang Zhou, Houqiang Li, Can Huang

In this mechanism, all the involved diverse visual table understanding (VTU) tasks and multi-source visual embeddings are abstracted as concepts.

Language Modelling Question Answering +3

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

1 code implementation20 May 2024 Jingqun Tang, Qi Liu, YongJie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao liu, Xiang Bai, Can Huang

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding.

Benchmarking Question Answering +4

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

no code implementations19 Apr 2024 Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao liu, Yuan Xie, Xiang Bai, Can Huang

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data.

Hallucination Hallucination Evaluation +2

Elysium: Exploring Object-level Perception in Videos via MLLM

1 code implementation25 Mar 2024 Han Wang, Yanjie Wang, YongJie Ye, Yuxiang Nie, Can Huang

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied.

Object Referring Expression +4

Metasql: A Generate-then-Rank Framework for Natural Language to SQL Translation

1 code implementation27 Feb 2024 Yuankai Fan, Zhenying He, Tonghui Ren, Can Huang, Yinan Jing, Kai Zhang, X. Sean Wang

While these translation models have greatly improved the overall translation accuracy, surpassing 70% on NLIDB benchmarks, the use of auto-regressive decoding to generate single SQL queries may result in sub-optimal outputs, potentially leading to erroneous translations.

Learning-To-Rank Translation

GloTSFormer: Global Video Text Spotting Transformer

1 code implementation8 Jan 2024 Han Wang, Yanjie Wang, Yang Li, Can Huang

In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames.

Text Spotting

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

1 code implementation CVPR 2024 Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Can Huang, Hao liu, Xin Tan, Zhizhong Zhang, Yuan Xie

A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios.

Diversity In-Context Learning +1

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

no code implementations20 Nov 2023 Hao Feng, Qi Liu, Hao liu, Jingqun Tang, Wengang Zhou, Houqiang Li, Can Huang

This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2, 560$\times$2, 560 resolution.

document understanding Language Modeling +3

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

3 code implementations ICCV 2023 Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin

To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.

Decoder Text Detection +1

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

no code implementations19 Aug 2023 Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang

However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored.

Instruction Following Text Detection +1

SPTS v2: Single-Point Scene Text Spotting

3 code implementations4 Jan 2023 Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, Lianwen Jin

Within the context of our SPTS v2 framework, our experiments suggest a potential preference for single-point representation in scene text spotting when compared to other representations.

Decoder Text Detection +1

Knowing Where and What: Unified Word Block Pretraining for Document Understanding

1 code implementation28 Jul 2022 Song Tao, Zijian Wang, Tiantian Fan, Canjie Luo, Can Huang

In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training.

Contrastive Learning document understanding +4

M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network

1 code implementation30 Apr 2020 Baichuan Huang, Hongwei Yi, Can Huang, Yijia He, Jingbin Liu, Xiao Liu

To improve the robustness and completeness of point cloud reconstruction, we propose a novel multi-metric loss function that combines pixel-wise and feature-wise loss function to learn the inherent constraints from different perspectives of matching correspondences.

Point cloud reconstruction

M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network

1 code implementation21 Apr 2020 Baichuan Huang, Hongwei Yi, Can Huang, Yijia He, Jingbin Liu, Xiao Liu

To improve the robustness and completeness of point cloud reconstruction, we propose a novel multi-metric loss function that combines pixel-wise and feature-wise loss function to learn the inherent constraints from different perspectives of matching correspondences.

Point cloud reconstruction

Cannot find the paper you are looking for? You can Submit a new open access paper.