no code implementations • 7 Jan 2025 • Haoning Xu, Zhaoqing Li, Zengrui Jin, Huimeng Wang, Youjun Chen, Guinan Li, Mengzhe Geng, Shujie Hu, Jiajun Deng, Xunying Liu
This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage.
no code implementations • 2 Jan 2025 • Jiajun Deng, Tianyu He, Li Jiang, Tianyu Wang, Feras Dayoub, Ian Reid
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning.
no code implementations • 25 Dec 2024 • Shujie Hu, Xurong Xie, Mengzhe Geng, Jiajun Deng, Zengrui Jin, Tianzi Wang, Mingyu Cui, Guinan Li, Zhaoqing Li, Helen Meng, Xunying Liu
Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3. 01% and 1. 50% absolute (10. 86% and 6. 94% relative) on the two tasks respectively.
no code implementations • 17 Dec 2024 • Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, Yanyong Zhang
For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV.
no code implementations • 11 Dec 2024 • Bohan Li, Xin Jin, Jiajun Deng, Yasheng Sun, XiaoFeng Wang, Wenjun Zeng
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.
3D Semantic Occupancy Prediction
LIDAR Semantic Segmentation
+2
1 code implementation • 13 Sep 2024 • Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable.
no code implementations • 20 Jul 2024 • Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Yao Li, Yanyong Zhang
To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view.
no code implementations • 8 Jul 2024 • Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu
The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 8 Jul 2024 • Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann
However, in the video domain, the existing setting, i. e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video.
no code implementations • 3 Jul 2024 • Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu
Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets.
1 code implementation • 2 Jul 2024 • Bohan Li, Jiajun Deng, Wenyao Zhang, Zhujin Liang, Dalong Du, Xin Jin, Wenjun Zeng
To address this problem, we present HTCL, a novel Hierarchical Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Ranked #11 on
3D Semantic Scene Completion
on SemanticKITTI
no code implementations • 14 Jun 2024 • Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu
We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model.
no code implementations • 14 Jun 2024 • Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jin, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu
This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems.
1 code implementation • 9 Apr 2024 • Henan Wang, Hanxin Zhu, Tianyu He, Runsen Feng, Jiajun Deng, Jiang Bian, Zhibo Chen
3D Gaussian Splatting (3DGS) has become an emerging technique with remarkable potential in 3D representation and image rendering.
1 code implementation • 18 Mar 2024 • Sha Zhang, Jiajun Deng, Lei Bai, Houqiang Li, Wanli Ouyang, Yanyong Zhang
We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised man- ner.
no code implementations • 18 Mar 2024 • Sha Zhang, Di Huang, Jiajun Deng, Shixiang Tang, Wanli Ouyang, Tong He, Yanyong Zhang
The ability to understand and reason the 3D real world is a crucial milestone towards artificial general intelligence.
no code implementations • 14 Mar 2024 • Jiajun Deng, Sha Zhang, Feras Dayoub, Wanli Ouyang, Yanyong Zhang, Ian Reid
In particular, our PoIFusion follows the paradigm of query-based object detection, formulating object queries as dynamic 3D boxes and generating a set of PoIs based on each query box.
1 code implementation • 29 Feb 2024 • Hao Feng, Wendi Wang, Shaokai Liu, Jiajun Deng, Wengang Zhou, Houqiang Li
In this work, we present DeepEraser, an effective deep network for generic text removal.
no code implementations • 23 Dec 2023 • Ning Wang, Jiajun Deng, Mingbo Jia
The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions.
no code implementations • 14 Dec 2023 • Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu
Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity.
1 code implementation • 18 Nov 2023 • Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, Yan Chen
We apply the FlashOCC to diverse occupancy prediction baselines on the challenging Occ3D-nuScenes benchmarks and conduct extensive experiments to validate the effectiveness.
no code implementations • 24 Oct 2023 • Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li
Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training.
no code implementations • ICCV 2023 • Xuanyu Yi, Jiajun Deng, Qianru Sun, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model.
no code implementations • ICCV 2023 • Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li
To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning.
1 code implementation • ICCV 2023 • Yunyao Mao, Jiajun Deng, Wengang Zhou, Yao Fang, Wanli Ouyang, Houqiang Li
To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints.
Few-Shot Skeleton-Based Action Recognition
motion prediction
+2
1 code implementation • ICCV 2023 • Yufei Yin, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li
These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance.
no code implementations • 6 Jul 2023 • Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date.
no code implementations • 27 Jun 2023 • Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui Jin, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu
Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity.
no code implementations • 26 Jun 2023 • Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu
Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies.
no code implementations • 23 Jun 2023 • Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen, Xunying Liu
Current ASR systems are mainly trained and evaluated at the utterance level.
1 code implementation • CVPR 2023 • Yingjie Wang, Jiajun Deng, Yao Li, Jinshui Hu, Cong Liu, Yu Zhang, Jianmin Ji, Wanli Ouyang, Yanyong Zhang
LiDAR and Radar are two complementary sensing approaches in that LiDAR specializes in capturing an object's 3D shape while Radar provides longer detection ranges as well as velocity hints.
no code implementations • 18 May 2023 • Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu
A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity.
1 code implementation • 18 Apr 2023 • Hao Feng, Shaokai Liu, Jiajun Deng, Wengang Zhou, Houqiang Li
To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images.
Ranked #1 on
Local Distortion
on DocUNet
no code implementations • 28 Feb 2023 • Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng
Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2. 0 models consistently outperform the standalone wav2vec2. 0 models by statistically significant WER reductions of 8. 22% and 3. 43% absolute (26. 71% and 15. 88% relative) on the two tasks respectively.
1 code implementation • 15 Feb 2023 • Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Guinan Li, Shujie Hu, Xunying Liu
Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 21 Jan 2023 • Hao Feng, Keyi Zhou, Wengang Zhou, Yufei Yin, Jiajun Deng, Qi Sun, Houqiang Li
It maintains a single estimate of the contour that is progressively deformed toward the object boundary.
Ranked #1 on
Semantic Contour Prediction
on Sbd val
no code implementations • 13 Jan 2023 • Xiaomeng Chu, Jiajun Deng, Yuan Zhao, Jianmin Ji, Yu Zhang, Houqiang Li, Yanyong Zhang
To this end, we propose OA-BEV, a network that can be plugged into the BEV-based 3D object detection framework to bring out the objects by incorporating object-aware pseudo-3D features and depth features.
1 code implementation • ICCV 2023 • Changyong Shu, Jiajun Deng, Fisher Yu, Yifan Liu
Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions.
1 code implementation • 27 Nov 2022 • Changyong Shu, Jiajun Deng, Fisher Yu, Yifan Liu
Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions.
no code implementations • 3 Nov 2022 • Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu
After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27. 78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57. 31% on the subset of speakers with "Very Low" intelligibility.
1 code implementation • 29 Oct 2022 • Yi Wang, Jiajun Deng, Tianzi Wang, Bo Zheng, Shoukang Hu, Xunying Liu, Helen Meng
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression.
2 code implementations • 15 Oct 2022 • Hao Feng, Wengang Zhou, Jiajun Deng, Yuechen Wang, Houqiang Li
In document image rectification, there exist rich geometric constraints between the distorted image and the ground truth one.
1 code implementation • 26 Aug 2022 • Yunyao Mao, Wengang Zhou, Zhenbo Lu, Jiajun Deng, Houqiang Li
In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem.
3D Action Recognition
Few-Shot Skeleton-Based Action Recognition
+3
no code implementations • 24 Jun 2022 • Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng
A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 23 Jun 2022 • Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng
Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 23 Jun 2022 • Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, Helen Meng
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression.
no code implementations • 15 Jun 2022 • Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Xunying Liu, Helen Meng
Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
1 code implementation • 14 Jun 2022 • Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang
For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.
no code implementations • 13 May 2022 • Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu
Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 5 Apr 2022 • Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng
Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+3
1 code implementation • 8 Jan 2022 • Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng
State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 29 Nov 2021 • Hanqi Zhu, Jiajun Deng, Yu Zhang, Jianmin Ji, Qiuyu Mao, Houqiang Li, Yanyong Zhang
However, this approach often suffers from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance.
3 code implementations • 28 Oct 2021 • Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li
The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency.
2 code implementations • 25 Oct 2021 • Hao Feng, Yuechen Wang, Wengang Zhou, Jiajun Deng, Houqiang Li
Specifically, DocTr consists of a geometric unwarping transformer and an illumination correction transformer.
1 code implementation • 30 Jul 2021 • Jiajun Deng, Wengang Zhou, Yanyong Zhang, Houqiang Li
To this end, in this work, we regard point clouds as hollow-3D data and propose a new architecture, namely Hallucinated Hollow-3D R-CNN ($\text{H}^2$3D R-CNN), to address the problem of 3D object detection.
1 code implementation • 6 Jul 2021 • Xiaomeng Chu, Jiajun Deng, Yao Li, Zhenxun Yuan, Yanyong Zhang, Jianmin Ji, Yu Zhang
As cameras are increasingly deployed in new application domains such as autonomous driving, performing 3D object detection on monocular images becomes an important task for visual scene understanding.
1 code implementation • 30 Jun 2021 • Yuechen Wang, Jiajun Deng, Wengang Zhou, Houqiang Li
To this end, we introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
no code implementations • 24 Jun 2021 • Yingjie Wang, Qiuyu Mao, Hanqi Zhu, Jiajun Deng, Yu Zhang, Jianmin Ji, Houqiang Li, Yanyong Zhang
In this survey, we first introduce the background of popular sensors used for self-driving, their data properties, and the corresponding object detection algorithms.
2 code implementations • ICCV 2021 • Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.
Ranked #18 on
Referring Expression Comprehension
on RefCOCO
1 code implementation • 31 Jan 2021 • Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, Hongsheng Li
3D object detection is receiving increasing attention from both industry and academia thanks to its wide applications in various fields.
Ranked #2 on
3D Object Detection
on KITTI Cars Easy val
5 code implementations • 31 Dec 2020 • Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, Houqiang Li
In this paper, we take a slightly different viewpoint -- we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy.
Ranked #4 on
3D Object Detection
on KITTI Cars Moderate val
1 code implementation • 15 Oct 2020 • Jinhua Zhu, Yingce Xia, Lijun Wu, Jiajun Deng, Wengang Zhou, Tao Qin, Houqiang Li
During inference, the CNN encoder and the policy network are used to take actions, and the Transformer module is discarded.
1 code implementation • 7 Jul 2020 • Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei
Single shot detectors that are potentially faster and simpler than two-stage detectors tend to be more applicable to object detection in videos.
1 code implementation • ECCV 2020 • Tianlang Chen, Jiajun Deng, Jiebo Luo
For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i. e. online hard negative).
2 code implementations • ICCV 2019 • Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei
In this paper, we introduce a new design to capture the interactions across the objects in spatio-temporal context.