1 code implementation • ICML 2020 • Jianshu Zhang, Jun Du, Yongxin Yang, Yi-Zhe Song, Si Wei, Li-Rong Dai
Recent encoder-decoder approaches typically employ string decoders to convert images into serialized strings for image-to-markup.
1 code implementation • 21 Dec 2024 • Jian Zhu, Xin Zou, Lei Liu, Zhangmin Huang, Ying Zhang, Chang Tang, Li-Rong Dai
The reasons for this problem are as follows: 1) The current methods ignore the presence of noise or redundant information in the view; 2) The similarity of contrastive learning comes from the same sample rather than the same cluster in deep multi-view clustering.
1 code implementation • 12 Dec 2023 • Jian Zhu, Yu Cui, Zhangmin Huang, Xingyu Li, Lei Liu, Lingfang Zeng, Li-Rong Dai
Furthermore, an adaptive confidence multi-view network is employed to measure the confidence of each view and then fuse multi-view features through a weighted summation.
no code implementations • 21 May 2023 • Mohan Shi, Yuchun Shu, Lingyun Zuo, Qian Chen, Shiliang Zhang, Jie Zhang, Li-Rong Dai
For speech interaction, voice activity detection (VAD) is often used as a front-end.
no code implementations • 21 May 2023 • Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai
In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance.
no code implementations • 20 May 2023 • Xiao-Min Zeng, Yan Song, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, Li-Rong Dai, Ian McLoughlin
In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD).
no code implementations • 7 Mar 2023 • Kang Li, Yan Song, Li-Rong Dai, Ian McLoughlin, Xin Fang, Lin Liu
In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED.
no code implementations • 1 Nov 2022 • Mohan Shi, Jie Zhang, Zhihao Du, Fan Yu, Qian Chen, Shiliang Zhang, Li-Rong Dai
Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 27 Oct 2022 • Qiu-Shi Zhu, Long Zhou, Jie Zhang, Shu-Jie Liu, Yu-Chen Hu, Li-Rong Dai
Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 26 May 2022 • Qiu-Shi Zhu, Jie Zhang, Zi-Qiang Zhang, Li-Rong Dai
Speech enhancement (SE) is usually required as a front end to improve the speech quality in noisy environments, while the enhanced speech might not be optimal for automatic speech recognition (ASR) systems due to speech distortion.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 5 Apr 2022 • Ye-Qian Du, Jie Zhang, Qiu-Shi Zhu, Li-Rong Dai, Ming-Hui Wu, Xin Fang, Zhou-Wang Yang
Unpaired data has shown to be beneficial for low-resource automatic speech recognition~(ASR), which can be involved in the design of hybrid models with multi-task training or language model dependent pre-training.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 15 Feb 2022 • Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai
The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy.
no code implementations • 22 Jan 2022 • Xing-Yu Chen, Qiu-Shi Zhu, Jie Zhang, Li-Rong Dai
By using the acoustic signals to train the network, respectively, we can build individual models for three tasks, whose parameters are averaged to obtain an average model, which is then used as the initialization for the BiLSTM model training of each task.
no code implementations • 22 Jan 2022 • Qiu-Shi Zhu, Jie Zhang, Zi-Qiang Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai
In this work, we therefore first analyze the noise robustness of wav2vec2. 0 via experiments.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 15 Mar 2021 • Zi-Qiang Zhang, Yan Song, Ming-Hui Wu, Xin Fang, Li-Rong Dai
In this paper, we propose a weakly supervised multilingual representation learning framework, called cross-lingual self-training (XLST).
no code implementations • 28 Dec 2020 • Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Chin-Hui Lee, Bao-Cai Yin
In this paper, we propose a novel deep learning architecture to improving word-level lip-reading.
no code implementations • 21 Sep 2020 • Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE).
no code implementations • 3 Sep 2020 • Jing-Xuan Zhang, Li-Juan Liu, Yan-Nian Chen, Ya-Jun Hu, Yuan Jiang, Zhen-Hua Ling, Li-Rong Dai
In this paper, we present a ASR-TTS method for voice conversion, which used iFLYTEK ASR engine to transcribe the source speech into text and a Transformer TTS model with WaveNet vocoder to synthesize the converted speech from the decoded text.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 6 Aug 2020 • Liangfa Wei, Jie Zhang, JunFeng Hou, Li-Rong Dai
The proposed method can sufficiently combine the two streams and weaken the over-reliance on the audio modality.
1 code implementation • 25 Jun 2019 • Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones.
Audio and Speech Processing Sound
no code implementations • 21 Jun 2019 • Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, Li-Rong Dai
This paper presents a method of using autoregressive neural networks for the acoustic modeling of singing voice synthesis (SVS).
no code implementations • 28 Mar 2019 • Lanhua You, Wu Guo, Li-Rong Dai, Jun Du
In this paper, gating mechanisms are applied in deep neural network (DNN) training for x-vector-based text-independent speaker verification.
no code implementations • 18 Jul 2018 • Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai
This paper proposes a forward attention method for the sequenceto- sequence acoustic modeling of speech synthesis.
1 code implementation • 4 Mar 2018 • Shiliang Zhang, Ming Lei, Zhijie Yan, Li-Rong Dai
In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM.
no code implementations • 22 Jan 2018 • Jianshu Zhang, Yixing Zhu, Jun Du, Li-Rong Dai
The RNN decoder aims at generating the caption by detecting radicals and spatial structures through an attention model.
2 code implementations • 5 Jan 2018 • Jianshu Zhang, Jun Du, Li-Rong Dai
Handwritten mathematical expression recognition is a challenging problem due to the complicated two-dimensional structures, ambiguous handwriting input and variant scales of handwritten math symbols.
1 code implementation • 4 Dec 2017 • Jianshu Zhang, Jun Du, Li-Rong Dai
In this study, we present a novel end-to-end approach based on the encoder-decoder framework with the attention mechanism for online handwritten mathematical expression recognition (OHMER).
no code implementations • 3 Nov 2017 • Jianshu Zhang, Yixing Zhu, Jun Du, Li-Rong Dai
Chinese characters have a huge set of character categories, more than 20, 000 and the number is still increasing as more and more novel characters continue being created.
no code implementations • 21 Mar 2017 • Yong Xu, Jun Du, Zhen Huang, Li-Rong Dai, Chin-Hui Lee
We propose a multi-objective framework to learn both secondary targets not directly related to the intended task of speech enhancement (SE) and the primary target of the clean log-power spectra (LPS) features to be used directly for constructing the enhanced speech signals.
Sound
no code implementations • 14 Mar 2017 • Junbei Zhang, Xiaodan Zhu, Qian Chen, Li-Rong Dai, Si Wei, Hui Jiang
The last several years have seen intensive interest in exploring neural-network-based models for machine comprehension (MC) and question answering (QA).
Ranked #39 on Question Answering on SQuAD1.1 dev
no code implementations • 28 Dec 2015 • Shiliang Zhang, Cong Liu, Hui Jiang, Si Wei, Li-Rong Dai, Yu Hu
In this paper, we propose a novel neural network structure, namely \emph{feedforward sequential memory networks (FSMN)}, to model long-term dependency in time series without using recurrent feedback.
no code implementations • 9 Oct 2015 • ShiLiang Zhang, Hui Jiang, Si Wei, Li-Rong Dai
We introduce a new structure for memory neural networks, called feedforward sequential memory networks (FSMN), which can learn long-term dependency without using recurrent feedback.
1 code implementation • 6 May 2015 • Shiliang Zhang, Hui Jiang, MingBin Xu, JunFeng Hou, Li-Rong Dai
In this paper, we propose the new fixed-size ordinally-forgetting encoding (FOFE) method, which can almost uniquely encode any variable-length sequence of words into a fixed-size representation.