Exploring the Capabilities of Large Multimodal Models on Dense Text

1 code implementation9 May 2024 Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai

To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs.

Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition

1 code implementation21 Feb 2024 Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai

By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance.

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

1 code implementation CVPR 2024 Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai

Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.

On the Hidden Mystery of OCR in Large Multimodal Models

1 code implementation13 May 2023 Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, XuCheng Yin, Cheng-Lin Liu, Lianwen Jin, Xiang Bai

In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER).

Feature Affinity Assisted Knowledge Distillation and Quantization of Deep Neural Networks on Label-Free Data

no code implementations10 Feb 2023 Zhijian Li, Biao Yang, Penghang Yin, Yingyong Qi, Jack Xin

In this paper, we propose a feature affinity (FA) assisted knowledge distillation (KD) method to improve quantization-aware training of deep neural networks (DNN).

Searching Intrinsic Dimensions of Vision Transformers

no code implementations16 Apr 2022 Fanghui Xue, Biao Yang, Yingyong Qi, Jack Xin

It has been shown by many researchers that transformers perform as well as convolutional neural networks in many computer vision tasks.

TPPO: A Novel Trajectory Predictor with Pseudo Oracle

no code implementations4 Feb 2020 Biao Yang, Caizhen He, Pin Wang, Ching-Yao Chan, Xiaofeng Liu, Yang Chen

A latent variable predictor is proposed to estimate latent variable distributions from observed and ground-truth trajectories.

A Novel Graph based Trajectory Predictor with Pseudo Oracle

no code implementations2 Feb 2020 Biao Yang, Guocheng Yan, Pin Wang, Ching-Yao Chan, Xiang Song, Yang Chen

Recent studies focus on modeling pedestrians' motion patterns with recurrent neural networks, capturing social interactions with pooling-based or graph-based methods, and handling future uncertainties by using random Gaussian noise as the latent variable.

