1 code implementation • 25 Mar 2024 • Han Wang, Yanjie Wang, YongJie Ye, Yuxiang Nie, Can Huang
To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG).
Ranked #1 on Zero-Shot Single Object Tracking on LaSOT
1 code implementation • 27 Feb 2024 • Yuankai Fan, Zhenying He, Tonghui Ren, Can Huang, Yinan Jing, Kai Zhang, X. Sean Wang
While these translation models have greatly improved the overall translation accuracy, surpassing 70% on NLIDB benchmarks, the use of auto-regressive decoding to generate single SQL queries may result in sub-optimal outputs, potentially leading to erroneous translations.
1 code implementation • 7 Feb 2024 • Jinghui Lu, Ziwei Yang, Yanjie Wang, Xuejing Liu, Brian Mac Namee, Can Huang
In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs).
1 code implementation • 8 Jan 2024 • Han Wang, Yanjie Wang, Yang Li, Can Huang
In this paper, we propose a novel Global Video Text Spotting Transformer GloTSFormer to model the tracking problem as global associations and utilize the Gaussian Wasserstein distance to guide the morphological correlation between frames.
1 code implementation • 22 Nov 2023 • Zhen Zhao, Jingqun Tang, Chunhui Lin, Binghong Wu, Can Huang, Hao liu, Xin Tan, Zhizhong Zhang, Yuan Xie
A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios.
no code implementations • 20 Nov 2023 • Hao Feng, Qi Liu, Hao liu, Wengang Zhou, Houqiang Li, Can Huang
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2, 560$\times$2, 560 resolution.
3 code implementations • ICCV 2023 • Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin
To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
no code implementations • 19 Aug 2023 • Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang
However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored.
3 code implementations • 4 Jan 2023 • Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, Lianwen Jin
Within the context of our SPTS v2 framework, our experiments suggest a potential preference for single-point representation in scene text spotting when compared to other representations.
Ranked #15 on Text Spotting on ICDAR 2015
1 code implementation • 28 Jul 2022 • Song Tao, Zijian Wang, Tiantian Fan, Canjie Luo, Can Huang
In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training.
1 code implementation • 30 Apr 2020 • Baichuan Huang, Hongwei Yi, Can Huang, Yijia He, Jingbin Liu, Xiao Liu
To improve the robustness and completeness of point cloud reconstruction, we propose a novel multi-metric loss function that combines pixel-wise and feature-wise loss function to learn the inherent constraints from different perspectives of matching correspondences.
1 code implementation • 21 Apr 2020 • Baichuan Huang, Hongwei Yi, Can Huang, Yijia He, Jingbin Liu, Xiao Liu
To improve the robustness and completeness of point cloud reconstruction, we propose a novel multi-metric loss function that combines pixel-wise and feature-wise loss function to learn the inherent constraints from different perspectives of matching correspondences.