Search Results for author: Can Huang

Found 12 papers, 10 papers with code

Elysium: Exploring Object-level Perception in Videos via MLLM

1 code implementation25 Mar 2024 Han Wang, Yanjie Wang, YongJie Ye, Yuxiang Nie, Can Huang

To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG).

Object Referring Expression +4

Metasql: A Generate-then-Rank Framework for Natural Language to SQL Translation

1 code implementation27 Feb 2024 Yuankai Fan, Zhenying He, Tonghui Ren, Can Huang, Yinan Jing, Kai Zhang, X. Sean Wang

While these translation models have greatly improved the overall translation accuracy, surpassing 70% on NLIDB benchmarks, the use of auto-regressive decoding to generate single SQL queries may result in sub-optimal outputs, potentially leading to erroneous translations.

Learning-To-Rank Translation

GloTSFormer: Global Video Text Spotting Transformer

1 code implementation8 Jan 2024 Han Wang, Yanjie Wang, Yang Li, Can Huang

In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames.

Text Spotting

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

1 code implementation22 Nov 2023 Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Can Huang, Hao liu, Xin Tan, Zhizhong Zhang, Yuan Xie

A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios.

In-Context Learning Scene Text Recognition

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

no code implementations20 Nov 2023 Hao Feng, Qi Liu, Hao liu, Wengang Zhou, Houqiang Li, Can Huang

This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2, 560$\times$2, 560 resolution.

document understanding Language Modelling +2

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

3 code implementations ICCV 2023 Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin

To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.

Text Detection Text Spotting

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

no code implementations19 Aug 2023 Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang

However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored.

Instruction Following Text Detection +1

SPTS v2: Single-Point Scene Text Spotting

3 code implementations4 Jan 2023 Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, Lianwen Jin

Within the context of our SPTS v2 framework, our experiments suggest a potential preference for single-point representation in scene text spotting when compared to other representations.

Text Detection Text Spotting

Knowing Where and What: Unified Word Block Pretraining for Document Understanding

1 code implementation28 Jul 2022 Song Tao, Zijian Wang, Tiantian Fan, Canjie Luo, Can Huang

In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training.

Contrastive Learning document understanding +3

M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network

1 code implementation30 Apr 2020 Baichuan Huang, Hongwei Yi, Can Huang, Yijia He, Jingbin Liu, Xiao Liu

To improve the robustness and completeness of point cloud reconstruction, we propose a novel multi-metric loss function that combines pixel-wise and feature-wise loss function to learn the inherent constraints from different perspectives of matching correspondences.

Point cloud reconstruction

M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network

1 code implementation21 Apr 2020 Baichuan Huang, Hongwei Yi, Can Huang, Yijia He, Jingbin Liu, Xiao Liu

To improve the robustness and completeness of point cloud reconstruction, we propose a novel multi-metric loss function that combines pixel-wise and feature-wise loss function to learn the inherent constraints from different perspectives of matching correspondences.

Point cloud reconstruction

Cannot find the paper you are looking for? You can Submit a new open access paper.