1 code implementation • ECCV 2020 • Yukang Wang, Wei Zhou, Tao Jiang, Xiang Bai, Yongchao Xu
In this paper, different from previous methods performing knowledge distillation for densely pairwise relations, we propose a novel intra-class feature variation distillation (IFVD) to transfer the intra-class feature variation (IFV) of the cumbersome model (teacher) to the compact model (student).
1 code implementation • 24 Jan 2025 • Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai
Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction.
1 code implementation • 20 Jan 2025 • Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai
Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications.
no code implementations • 2 Jan 2025 • Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories.
1 code implementation • 31 Dec 2024 • Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently.
1 code implementation • 27 Dec 2024 • Xingyu Jiang, Jiangwei Ren, Zizhuo Li, Xin Zhou, Dingkang Liang, Xiang Bai
Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data.
1 code implementation • 5 Dec 2024 • Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language.
1 code implementation • 15 Nov 2024 • Hao Wang, Minghui Liao, Zhouyi Xie, Wenyu Liu, Xiang Bai
To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples.
1 code implementation • 31 Oct 2024 • Zhenbiao Cao, Yuanlei Zheng, Zhihao Fan, Xiaojin Zhang, Wei Chen, Xiang Bai
Text-to-SQL generation aims to translate natural language questions into SQL statements.
2 code implementations • 23 Oct 2024 • Linger Deng, Yuliang Liu, Bohan Li, Dongliang Luo, Liang Wu, Chengquan Zhang, Pengyuan Lyu, Ziyang Zhang, Gang Zhang, Errui Ding, Yingying Zhu, Xiang Bai
Current geometric data generation approaches, which apply preset templates to generate geometric data or use Large Language Models (LLMs) to rephrase questions and answers (Q&A), unavoidably limit data accuracy and diversity.
1 code implementation • 21 Oct 2024 • Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Xiang Bai
The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding.
1 code implementation • 15 Oct 2024 • Bin Shan, Xiang Fei, Wei Shi, An-Lan Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, Can Huang
The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications.
1 code implementation • 10 Oct 2024 • Dingkang Liang, Tianrui Feng, Xin Zhou, Yumeng Zhang, Zhikang Zou, Xiang Bai
PointGST freezes the pre-trained model and introduces a lightweight, trainable Point Cloud Spectral Adapter (PCSA) to fine-tune parameters in the spectral domain.
Ranked #1 on
3D Point Cloud Classification
on ModelNet40
(using extra training data)
3D Parameter-Efficient Fine-Tuning for Classification
3D Point Cloud Classification
+3
1 code implementation • 8 Oct 2024 • Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai
In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
1 code implementation • 1 Sep 2024 • Dingyuan Zhang, Dingkang Liang, Zichang Tan, Xiaoqing Ye, Cheng Zhang, Jingdong Wang, Xiang Bai
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving.
no code implementations • 15 Aug 2024 • eiyao Zhao, Zhengshuo Li, Jiahui Zhang, Xiang Bai, Jia Su
To address these problems, this paper first establishes an operational model considering gas pipeline dynamic characteristics under uncertain leakage failures for the NGS and then presents a stochastic IEGS real-time economic dispatch (RTED) model considering both uncertainty propagation and pipeline leakage uncertainty.
no code implementations • 14 Aug 2024 • Tingfeng Huang, Yuxuan Cheng, Jingbo Xia, Rui Yu, Yuxuan Cai, Jinhai Xiang, Xinwei He, Xiang Bai
The reconstruction branch is simply a plain reconstruction network that learns to reconstruct normal samples, while the auxiliary branch aims to produce attention masks to guide the noise perturbation process for normal samples from easy to hard.
1 code implementation • 4 Aug 2024 • Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai
To address this issue, we introduce a Complementary Image Pyramid (CIP), a simple, effective, and plug-and-play solution designed to mitigate semantic discontinuity during high-resolution image processing.
1 code implementation • 31 Jul 2024 • Xudong Xie, Yuzhe Li, Yang Liu, Zhifei Zhang, Zhaowen Wang, Wei Xiong, Xiang Bai
One challenge of the task is that the local stroke shapes of artistic text are changeable with diversity and complexity.
1 code implementation • 25 Jul 2024 • Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai
To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features.
Ranked #1 on
3D Object Detection
on Waymo Open Dataset
1 code implementation • 23 Jul 2024 • Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, Xiang Bai
We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images.
1 code implementation • 15 Jul 2024 • Zhe Liu, Jinghua Hou, Xiaoqing Ye, Tong Wang, Jingdong Wang, Xiang Bai
We argue that the main challenges are twofold: 1) How to obtain the appropriate object queries is challenging due to the high sparsity and uneven distribution of point clouds; 2) How to implement an effective query interaction by exploiting the rich geometric structure of point clouds is not fully explored.
1 code implementation • 15 Jul 2024 • Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai
Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection.
1 code implementation • 3 Jul 2024 • Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, Xiang Bai
We propose UniSeg3D, a unified 3D scene understanding framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary segmentation tasks within a single model.
1 code implementation • 1 Jul 2024 • Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai
Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC).
1 code implementation • 7 Jun 2024 • Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai
The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency.
1 code implementation • 5 Jun 2024 • Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
1 code implementation • 2 Jun 2024 • Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu
Originating from China's Shang Dynasty approximately 3, 000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems.
1 code implementation • 21 May 2024 • Hiba Maryam, Ling Fu, Jiajun Song, Tajrian ABM Shafayet, Qidi Luo, Xiang Bai, Yuliang Liu
The development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies is crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content, facilitating better understanding and interaction with Urdu-language visual data.
1 code implementation • 20 May 2024 • Jingqun Tang, Qi Liu, YongJie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao liu, Xiang Bai, Can Huang
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding.
1 code implementation • 19 May 2024 • Fadila Wendigoundi Douamba, Jianjun Song, Ling Fu, Yuliang Liu, Xiang Bai
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
1 code implementation • 9 May 2024 • Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai
To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs.
1 code implementation • 30 Apr 2024 • Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai
Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters.
no code implementations • 19 Apr 2024 • Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao liu, Yuan Xie, Xiang Bai, Can Huang
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data.
4 code implementations • CVPR 2024 • Mingxin Huang, Hongliang Li, Yuliang Liu, Xiang Bai, Lianwen Jin
Subsequently, we introduce a Bridge that connects the locked detector and recognizer through a zero-initialized neural network.
1 code implementation • 4 Apr 2024 • Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Fan Wang, Xiang Bai
Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video.
1 code implementation • 28 Mar 2024 • Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang
Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions.
1 code implementation • 21 Mar 2024 • Zheng Zhang, Yeyao Ma, Enming Zhang, Xiang Bai
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
Ranked #2 on
Referring Expression Segmentation
on RefCoCo val
(using extra training data)
no code implementations • 14 Mar 2024 • Yuxuan Cai, Xinwei He, Dingkang Liang, Ao Tong, Xiang Bai
Recently, large vision and language models have shown their success when adapting them to many downstream tasks.
1 code implementation • 7 Mar 2024 • Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
1 code implementation • CVPR 2024 • Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, Xiang Bai
To achieve this goal, we freeze the parameters of the default pre-trained models and then propose the Dynamic Adapter, which generates a dynamic scale for each token, considering the token significance to the downstream task.
3D Parameter-Efficient Fine-Tuning for Classification
Transfer Learning
no code implementations • 24 Feb 2024 • Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
2 code implementations • 21 Feb 2024 • Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai
By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance.
1 code implementation • 16 Feb 2024 • Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, Xiang Bai
Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs.
no code implementations • 31 Jan 2024 • Wei Chen, Hengxu Lin, Qun Zhang, Xiaojin Zhang, Xiang Bai, Xuanjing Huang, Zhongyu Wei
Emotional Support Conversation aims at reducing the seeker's emotional distress through supportive response.
2 code implementations • 27 Jan 2024 • Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Jinpeng Wan, Haisu Guan, Zhebin Kuang, Lianwen Jin, Xiang Bai, Yuliang Liu
Oracle bone script, one of the earliest known forms of ancient Chinese writing, presents invaluable research materials for scholars studying the humanities and geography of the Shang Dynasty, dating back 3, 000 years.
no code implementations • 27 Jan 2024 • Kaixin Xiong, Dingyuan Zhang, Dingkang Liang, Zhe Liu, Hongcheng Yang, Wondimu Dikubab, Jianwei Cheng, Xiang Bai
Monocular 3D Object Detection is an essential task for autonomous driving.
no code implementations • 23 Jan 2024 • Haisu Guan, Jinpeng Wan, Yuliang Liu, Pengjie Wang, Kaile Zhang, Zhebin Kuang, Xinyu Wang, Xiang Bai, Lianwen Jin
We conducted validation and simulated deciphering on the constructed dataset, and the results demonstrate its high efficacy in aiding the study of oracle bone script.
no code implementations • 15 Jan 2024 • Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen Jin
In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition.
1 code implementation • CVPR 2024 • Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang
Recently visually-situated text parsing (VsTP) has experienced notable advancements driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions.
no code implementations • 21 Dec 2023 • Linger Deng, Mingxin Huang, Xudong Xie, Yuliang Liu, Lianwen Jin, Xiang Bai
We demonstrate the accuracy of the generated polygons through extensive experiments: 1) By creating polygons from ground truth points, we achieved an accuracy of 82. 0% on ICDAR 2015; 2) In training detectors with polygons generated by our method, we attained 86% of the accuracy relative to training with ground truth (GT); 3) Additionally, the proposed Point2Polygon can be seamlessly integrated to empower single-point spotters to generate polygons.
no code implementations • 16 Dec 2023 • Wei Chen, Gang Zhao, Xiaojin Zhang, Xiang Bai, Xuanjing Huang, Zhongyu Wei
Automatic psychological counseling requires mass of professional knowledge that can be found in online counseling forums.
1 code implementation • CVPR 2024 • Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.
Ranked #1 on
Referring Video Object Segmentation
on Refer-YouTube-VOS
(using extra training data)
Long-tail Video Object Segmentation
Multi-Object Tracking
+8
1 code implementation • 12 Dec 2023 • Dongliang Luo, Yuliang Liu, Rui Yang, Xianjin Liu, Jishen Zeng, Yu Zhou, Xiang Bai
With the surge in realistic text tampering, detecting fraudulent text in images has gained prominence for maintaining information security.
1 code implementation • 28 Nov 2023 • Ling Fu, Zijie Wu, Yingying Zhu, Yuliang Liu, Xiang Bai
We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background.
1 code implementation • CVPR 2024 • Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai
Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.
Ranked #13 on
MMR total
on MRR-Benchmark
(using extra training data)
1 code implementation • 23 Oct 2023 • Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, Zhongyu Wei
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM), DISC-FinLLM.
no code implementations • 12 Oct 2023 • Zijie Wu, Chaohui Yu, Zhen Zhu, Fan Wang, Xiang Bai
To utilize the abundant visual priors in the off-the-shelf T2I models, a series of methods try to invert an image to proper embedding that aligns with the semantic space of the T2I model.
1 code implementation • 11 Oct 2023 • Yuxuan Cai, Dingkang Liang, Dongliang Luo, Xinwei He, Xin Yang, Xiang Bai
To alleviate this issue, we present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies across different anomaly detection benchmarks.
no code implementations • 5 Sep 2023 • Xin Zhou, Jinghua Hou, Tingting Yao, Dingkang Liang, Zhe Liu, Zhikang Zou, Xiaoqing Ye, Jianwei Cheng, Xiang Bai
3D object detection is an essential task for achieving autonomous driving.
1 code implementation • 21 Aug 2023 • Wenwen Yu, Yuliang Liu, Xingkui Zhu, Haoyu Cao, Xing Sun, Xiang Bai
Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26. 5% and 5. 5% for text detection and spotting tasks, respectively.
no code implementations • 21 Aug 2023 • Zhuang Liu, Ye Yuan, Zhilong Ji, Jingfeng Bai, Xiang Bai
Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space.
3 code implementations • ICCV 2023 • Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin
To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
1 code implementation • IEEE Transactions on Image Processing 2023 • Cairong Zhao, Zefan Qu, Xinyang Jiang, Yuanpeng Tu, Xiang Bai
To address these challenges, we propose a novel Content-Adaptive Auto-Occlusion Network (CAAO), that is able to dynamically select the proper occlusion region of an image based on its content and the current training status.
2 code implementations • 8 Jun 2023 • Zelin Liu, Xinggang Wang, Cheng Wang, Wenyu Liu, Xiang Bai
By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack.
Ranked #6 on
Multi-Object Tracking
on MOT20
(using extra training data)
1 code implementation • 6 Jun 2023 • Wenwen Yu, MingYu Liu, Biao Yang, Enming Zhang, Deqiang Jiang, Xing Sun, Yuliang Liu, Xiang Bai
Text recognition in the wild is a long-standing problem in computer vision.
no code implementations • 5 Jun 2023 • Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, MingYu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yuechen Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Cheng-Lin Liu, Jiebo Luo, Shuicheng Yan, Min Zhang, Dimosthenis Karatzas, Xing Sun, Jingdong Wang, Xiang Bai
It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.
1 code implementation • 4 Jun 2023 • Dingyuan Zhang, Dingkang Liang, Hongcheng Yang, Zhikang Zou, Xiaoqing Ye, Zhe Liu, Xiang Bai
In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks.
1 code implementation • 13 May 2023 • Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, XuCheng Yin, Cheng-Lin Liu, Lianwen Jin, Xiang Bai
In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER).
1 code implementation • 12 May 2023 • Zhe Liu, Xiaoqing Ye, Zhikang Zou, Xinwei He, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai
Extensive experiments on the nuScenes dataset demonstrate that our method is much more stable in dealing with challenging cases such as asynchronous sensors, misaligned sensor placement, and degenerated camera images than existing fusion methods.
Ranked #48 on
3D Object Detection
on nuScenes
1 code implementation • 12 May 2023 • Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Xiang Bai
We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE (a widely used English dataset) to our proposed dataset due to the larger variance of layout and entities.
1 code implementation • 5 May 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, Xiang Bai
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i. e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video.
no code implementations • 24 Apr 2023 • Wenwen Yu, MingYu Liu, Mingrui Chen, Ning Lu, Yinlong Wen, Yuliang Liu, Dimosthenis Karatzas, Xiang Bai
To promote research in this area, we organized ICDAR 2023 competition on reading the seal title (ReST), which included two tasks: seal title text detection (Task 1) and end-to-end seal title recognition (Task 2).
no code implementations • 10 Apr 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Mike Zheng Shou, Umapada Pal, Dimosthenis Karatzas, Xiang Bai
In this competition report, we establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video with various scenarios.
1 code implementation • CVPR 2023 • Wei Hua, Dingkang Liang, Jingyu Li, Xiaolong Liu, Zhikang Zou, Xiaoqing Ye, Xiang Bai
Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for boosting object detectors, has become an active task in recent years.
2 code implementations • CVPR 2023 • Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, Xiang Bai
To the best of our knowledge, CrowdCLIP is the first to investigate the vision language knowledge to solve the counting problem.
Ranked #1 on
Cross-Part Crowd Counting
on ShanghaiTech B
no code implementations • CVPR 2023 • Zhibo Yang, Rujiao Long, Pengfei Wang, Sibo Song, Humen Zhong, Wenqing Cheng, Xiang Bai, Cong Yao
As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common.
2 code implementations • CVPR 2023 • Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, Xiang Bai
In this paper, we address the problem of detecting 3D objects from multi-view images.
Ranked #9 on
3D Object Detection
on nuScenes Camera Only
1 code implementation • CVPR 2023 • Qihao Liu, Junfeng Wu, Yi Jiang, Xiang Bai, Alan Yuille, Song Bai
A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement.
1 code implementation • CVPR 2023 • Wenwen Yu, Yuliang Liu, Wei Hua, Deqiang Jiang, Bo Ren, Xiang Bai
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
3 code implementations • CVPR 2023 • Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, Xiang Bai
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks.
Ranked #5 on
Zero Shot Segmentation
on Segmentation in the Wild
no code implementations • 4 Jan 2023 • Zhe Liu, Xiaoqing Ye, Xiao Tan, Errui Ding, Xiang Bai
In this paper, we propose a cross-modal distillation method named StereoDistill to narrow the gap between the stereo and LiDAR-based approaches via distilling the stereo detectors from the superior LiDAR model at the response level, which is usually overlooked in 3D object detection distillation.
3 code implementations • 4 Jan 2023 • Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chunhua Shen, Xiang Bai, Lianwen Jin
Within the context of our SPTS v2 framework, our experiments suggest a potential preference for single-point representation in scene text spotting when compared to other representations.
Ranked #15 on
Text Spotting
on ICDAR 2015
no code implementations • ICCV 2023 • Dingyuan Zhang, Dingkang Liang, Zhikang Zou, Jingyu Li, Xiaoqing Ye, Zhe Liu, Xiao Tan, Xiang Bai
Advanced 3D object detection methods usually rely on large-scale, elaborately labeled datasets to achieve good performance.
no code implementations • 18 Nov 2022 • Junfeng Wu, Yi Jiang, Qihao Liu, Xiang Bai, Song Bai
This technical report describes our 2nd-place solution for the ECCV 2022 YouTube-VIS Long Video Challenge.
1 code implementation • 12 Nov 2022 • Tianyi Shi, Xiaohuan Ding, Wei Zhou, Feng Pan, Zengqiang Yan, Xiang Bai, Xin Yang
Vessel segmentation is crucial in many medical image applications, such as detecting coronary stenoses, retinal vessel diseases and brain aneurysms.
no code implementations • 27 Sep 2022 • HUI ZHANG, Quanming Yao, James T. Kwok, Xiang Bai
We design a domain-specific search space by exploring principles for having good feature extractors.
Neural Architecture Search
Vocal Bursts Intensity Prediction
1 code implementation • 31 Jul 2022 • Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, Xiang Bai
Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points, with the assistance of a corner-query cross-attention mechanism.
no code implementations • 25 Jul 2022 • Jingqun Tang, Wenming Qian, Luchuan Song, Xiena Dong, Lan Li, Xiang Bai
Text detection and recognition are essential components of a modern OCR system.
3 code implementations • 23 Jul 2022 • Bohan Li, Ye Yuan, Dingkang Liang, Xiao Liu, Zhilong Ji, Jinfeng Bai, Wenyu Liu, Xiang Bai
Recently, most handwritten mathematical expression recognition (HMER) methods adopt the encoder-decoder networks, which directly predict the markup sequences from formula images with the attention mechanism.
2 code implementations • 21 Jul 2022 • Junfeng Wu, Qihao Liu, Yi Jiang, Song Bai, Alan Yuille, Xiang Bai
In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance.
Ranked #14 on
Video Instance Segmentation
on YouTube-VIS 2021
1 code implementation • 11 Jul 2022 • Zijie Wu, Zhen Zhu, Junping Du, Xiang Bai
CCPL can preserve the coherence of the content source during style transfer without degrading stylization.
1 code implementation • 1 Jul 2022 • Mingkun Yang, Minghui Liao, Pu Lu, Jing Wang, Shenggao Zhu, Hualin Luo, Qi Tian, Xiang Bai
Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method.
2 code implementations • CVPR 2022 • Sibo Song, Jianqiang Wan, Zhibo Yang, Jun Tang, Wenqing Cheng, Xiang Bai, Cong Yao
In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language.
no code implementations • 16 Apr 2022 • Shi Gong, Xiaoqing Ye, Xiao Tan, Jingdong Wang, Errui Ding, Yu Zhou, Xiang Bai
Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving for its powerful spatial representation ability.
1 code implementation • CVPR 2022 • Xiaolong Liu, Song Bai, Xiang Bai
Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm, where the video encoder is pre-trained for action classification, and only the detection head upon the encoder is optimized for TAD.
Ranked #19 on
Temporal Action Localization
on THUMOS’14
no code implementations • CVPR 2022 • Jingqun Tang, Wenqing Zhang, Hongye Liu, Mingkun Yang, Bo Jiang, Guanglong Hu, Xiang Bai
Different from previous approaches that learn robust deep representations of scene text in a holistic manner, our method performs scene text detection based on a few representative features, which avoids the disturbance by background and reduces the computational cost.
Ranked #24 on
Object Detection In Aerial Images
on DOTA
(using extra training data)
1 code implementation • CVPR 2022 • Hao Wang, Junchao Liao, Tianheng Cheng, Zewen Gao, Hao liu, Bo Ren, Xiang Bai, Wenyu Liu
Recently, the semantics of scene text has been proven to be essential in fine-grained image classification.
no code implementations • 23 Mar 2022 • Wondimu Dikubab, Dingkang Liang, Minghui Liao, Xiang Bai
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages (e. g., Amharic, Tigrinya) in East Africa for more than 120 million people.
2 code implementations • CVPR 2022 • Ye Yuan, Xiao Liu, Wondimu Dikubab, Hui Liu, Zhilong Ji, Zhongqin Wu, Xiang Bai
In this paper, we propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
1 code implementation • 26 Feb 2022 • Dingkang Liang, Wei Xu, Xiang Bai
Crowd localization, predicting head positions, is a more practical and high-level task than simply counting.
5 code implementations • 21 Feb 2022 • Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, Xiang Bai
By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.
Ranked #3 on
Scene Text Detection
on MSRA-TD500
2 code implementations • 29 Dec 2021 • Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, Xiang Bai
However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images.
1 code implementation • 21 Dec 2021 • Zhe Liu, Tengteng Huang, Bingling Li, Xiwu Chen, Xi Wang, Xiang Bai
Recently, fusing the LiDAR point cloud and camera image to improve the performance and robustness of 3D object detection has received more and more attention, as these two modalities naturally possess strong complementarity.
2 code implementations • 15 Dec 2021 • Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai
Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently.
Ranked #2 on
Video Instance Segmentation
on HQ-YTVIS
1 code implementation • 15 Dec 2021 • Dezhi Peng, Xinyu Wang, Yuliang Liu, Jiaxin Zhang, Mingxin Huang, Songxuan Lai, Shenggao Zhu, Jing Li, Dahua Lin, Chunhua Shen, Xiang Bai, Lianwen Jin
For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance.
Ranked #3 on
Text Spotting
on SCUT-CTW1500
1 code implementation • 9 Dec 2021 • Silin Cheng, Xiwu Chen, Xinwei He, Zhe Liu, Xiang Bai
Learning intra-region contexts and inter-region relations are two effective strategies to strengthen feature representations for point cloud analysis.
Ranked #45 on
3D Point Cloud Classification
on ModelNet40
1 code implementation • 18 Nov 2021 • Xiang Bai, Hanchen Wang, Liya Ma, Yongchao Xu, Jiefeng Gan, Ziwei Fan, Fan Yang, Ke Ma, Jiehua Yang, Song Bai, Chang Shu, Xinyu Zou, Renhao Huang, Changzheng Zhang, Xiaowu Liu, Dandan Tu, Chuou Xu, Wenqing Zhang, Xi Wang, Anguo Chen, Yu Zeng, Dehua Yang, Ming-Wei Wang, Nagaraj Holalkere, Neil J. Halin, Ihab R. Kamel, Jia Wu, Xuehua Peng, Xiang Wang, Jianbo Shao, Pattanasak Mongkolwat, Jianjun Zhang, Weiyang Liu, Michael Roberts, Zhongzhao Teng, Lucian Beer, Lorena Escudero Sanchez, Evis Sala, Daniel Rubin, Adrian Weller, Joan Lasenby, Chuangsheng Zheng, Jianming Wang, Zhen Li, Carola-Bibiane Schönlieb, Tian Xia
Artificial intelligence (AI) provides a promising substitution for streamlining COVID-19 diagnoses.
no code implementations • 15 Nov 2021 • Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip H. S. Torr, Song Bai
To promote the development of occlusion understanding, we collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario.
1 code implementation • 9 Nov 2021 • Yuzhe Gao, Xing Li, Jiajian Zhang, Yu Zhou, Dian Jin, Jing Wang, Shenggao Zhu, Xiang Bai
We leverage a Siamese ComplementaryModule to fully exploit the continuity characteristic of the textinstances in the temporal dimension, which effectively alleviatesthe missed detection of the text instances, and hence ensuresthe completeness of each text trajectory.
1 code implementation • NeurIPS 2021 • Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Stephen Lin, Han Hu, Xiang Bai
We introduce MixTraining, a new training paradigm for object detection that can improve the performance of existing detectors for free.
1 code implementation • 30 Aug 2021 • Gui-Song Xia, Jian Ding, Ming Qian, Nan Xue, Jiaming Han, Xiang Bai, Michael Ying Yang, Shengyang Li, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, Liangpei Zhang, Qiang Zhou, Chao-hui Yu, Kaixuan Hu, Yingjia Bu, Wenming Tan, Zhe Yang, Wei Li, Shang Liu, Jiaxuan Zhao, Tianzhi Ma, Zi-han Gao, Lingqi Wang, Yi Zuo, Licheng Jiao, Chang Meng, Hao Wang, Jiahao Wang, Yiming Hui, Zhuojun Dong, Jie Zhang, Qianyue Bao, Zixiao Zhang, Fang Liu
This report summarizes the results of Learning to Understand Aerial Images (LUAI) 2021 challenge held on ICCV 2021, which focuses on object detection and semantic segmentation in aerial images.
5 code implementations • 25 Aug 2021 • Dong Wu, Manwen Liao, Weitian Zhang, Xinggang Wang, Xiang Bai, Wenqing Cheng, Wenyu Liu
A panoptic driving perception system is an essential part of autonomous driving.
Ranked #3 on
Drivable Area Detection
on BDD100K val
1 code implementation • 19 Jul 2021 • Dawei Du, Longyin Wen, Pengfei Zhu, Heng Fan, QinGhua Hu, Haibin Ling, Mubarak Shah, Junwen Pan, Ali Al-Ali, Amr Mohamed, Bakour Imene, Bin Dong, Binyu Zhang, Bouchali Hadia Nesma, Chenfeng Xu, Chenzhen Duan, Ciro Castiello, Corrado Mencar, Dingkang Liang, Florian Krüger, Gennaro Vessio, Giovanna Castellano, Jieru Wang, Junyu Gao, Khalid Abualsaud, Laihui Ding, Lei Zhao, Marco Cianciotta, Muhammad Saqib, Noor Almaadeed, Omar Elharrouss, Pei Lyu, Qi Wang, Shidong Liu, Shuang Qiu, Siyang Pan, Somaya Al-Maadeed, Sultan Daud Khan, Tamer Khattab, Tao Han, Thomas Golda, Wei Xu, Xiang Bai, Xiaoqing Xu, Xuelong Li, Yanyun Zhao, Ye Tian, Yingnan Lin, Yongchao Xu, Yuehan Yao, Zhenyu Xu, Zhijian Zhao, Zhipeng Luo, Zhiwei Wei, Zhiyuan Zhao
Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint.
no code implementations • CVPR 2021 • Jing Wang, Jinhui Tang, Mingkun Yang, Xiang Bai, Jiebo Luo
Under the guidance of the geometrical relationship between OCR tokens, our LSTM-R capitalizes on a newly-devised relation-aware pointer network to select OCR tokens from the scene text for OCR-based image captioning.
1 code implementation • 18 Jun 2021 • Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, Xiang Bai
Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video.
Ranked #10 on
Temporal Action Localization
on HACS
8 code implementations • ICCV 2021 • Mengde Xu, Zheng Zhang, Han Hu, JianFeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, Zicheng Liu
This paper presents an end-to-end semi-supervised object detection approach, in contrast to previous more complex multi-stage methods.
Ranked #6 on
Semi-Supervised Object Detection
on COCO 100% labeled data
(using extra training data)
1 code implementation • 19 Apr 2021 • Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, Xiang Bai
Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm.
1 code implementation • CVPR 2021 • Hao Wang, Xiang Bai, Mingkun Yang, Shenggao Zhu, Jing Wang, Wenyu Liu
Such a task is usually realized by matching a query text to the recognized words, outputted by an end-to-end scene text spotter.
no code implementations • CVPR 2021 • Minghang He, Minghui Liao, Zhibo Yang, Humen Zhong, Jun Tang, Wenqing Cheng, Cong Yao, Yongpan Wang, Xiang Bai
Over the past few years, the field of scene text detection has progressed rapidly that modern text detectors are able to hunt text in various challenging scenarios.
1 code implementation • 22 Mar 2021 • Zhen Zhu, Tengteng Huang, Mengde Xu, Baoguang Shi, Wenqing Cheng, Xiang Bai
This paper proposes a new generative adversarial network for pose transfer, i. e., transferring the pose of a given person to a target pose.
1 code implementation • 18 Mar 2021 • Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shjian Lu, C. V. Jawahar
In this competition, we set up three tasks, namely, Scanned Receipt Text Localisation (Task 1), Scanned Receipt OCR (Task 2) and Key Information Extraction from Scanned Receipts (Task 3).
Key Information Extraction
Optical Character Recognition (OCR)
+1
2 code implementations • 24 Feb 2021 • Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Micheal Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, Liangpei Zhang
In this paper, we present a large-scale Dataset of Object deTection in Aerial images (DOTA) and comprehensive baselines for ODAI.
no code implementations • 23 Feb 2021 • Zhiliang Xu, Xiyu Yu, Zhibin Hong, Zhen Zhu, Junyu Han, Jingtuo Liu, Errui Ding, Xiang Bai
By simply employing some existing and easy-obtainable prior information, our method can control, transfer, and edit diverse attributes of faces in the wild.
Ranked #1 on
Face Swapping
on FaceForensics++
(FID metric)
2 code implementations • 2 Feb 2021 • Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip H. S. Torr, Song Bai
On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16. 3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario.
Ranked #33 on
Video Instance Segmentation
on YouTube-VIS validation
1 code implementation • CVPR 2021 • Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, Philip H. S. Torr
Current developments in temporal event or action localization usually target actions captured by a single camera.
Ranked #2 on
Temporal Action Localization
on MUSES
1 code implementation • 14 Dec 2020 • Yang Liu, Zhen Zhu, Xiang Bai
Visible watermarks are widely-used in images to protect copyright ownership.
no code implementations • 9 Dec 2020 • Wenqing Zhang, Yang Qiu, Minghui Liao, Rui Zhang, Xiaolin Wei, Xiang Bai
It is a general labeling method for texts with various shapes and requires low labeling costs.
1 code implementation • 26 Sep 2020 • Wei Zhou, Yukang Wang, Jiajia Chu, Jiehua Yang, Xiang Bai, Yongchao Xu
Specifically, we perform domain adaptation on the affinity relationship between adjacent pixels termed affinity space of source and target domain.
no code implementations • 22 Jul 2020 • Wenqing Zhang, Yang Qiu, Song Bai, Rui Zhang, Xiaolin Wei, Xiang Bai
In this paper, we study how to make use of decentralized datasets for training a robust scene text recognizer while keeping them stay on local devices.
1 code implementation • ECCV 2020 • Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, Xiang Bai
Recent end-to-end trainable methods for scene text spotting, integrating detection and recognition, showed much progress.
Ranked #11 on
Text Spotting
on Total-Text
1 code implementation • ECCV 2020 • Tengteng Huang, Zhe Liu, Xiwu Chen, Xiang Bai
In this paper, we aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors~(namely LiDAR point cloud and camera image), as well as the inconsistency between the localization and classification confidence.
no code implementations • 9 Jul 2020 • Changxu Cheng, Wuheng Xu, Xiang Bai, Bin Feng, Wenyu Liu
Chinese text recognition is more challenging than Latin text due to the large amount of fine-grained Chinese characters and the great imbalance over classes, which causes a serious overfitting problem.