1 code implementation • COLING 2022 • Yuxuan Wang, Zhilin Lei, Yuqiu Ji, Wanxiang Che
Annotation conversion is an effective way to construct datasets under new annotation guidelines based on existing datasets with little human labour.
1 code implementation • 30 Nov 2023 • Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang
Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen.
no code implementations • 13 Oct 2023 • Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu
Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.
no code implementations • 27 Sep 2023 • Xiaowen Sun, Jiazhan Feng, Yuxuan Wang, Yuxuan Lai, Xingyu Shen, Dongyan Zhao
Various works have been extensively studied in the research of text-to-image generation.
no code implementations • 28 Aug 2023 • Bing Han, Junyu Dai, Xuchen Song, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian
We tested InstructME in instrument-editing, remixing, and multi-round editing.
1 code implementation • 22 Aug 2023 • Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu, Pingchuan Tian, Yuping Wang, Yuxuan Wang
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP).
1 code implementation • 10 Aug 2023 • Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
Ranked #2 on
Audio Generation
on AudioCaps
1 code implementation • 9 Aug 2023 • Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
1 code implementation • 5 Jun 2023 • Yuxuan Wang, Hong Lyu
The information retrieval community has made significant progress in improving the efficiency of Dual Encoder (DE) dense passage retrieval systems, making them suitable for latency-sensitive settings.
no code implementations • 5 Jun 2023 • Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
no code implementations • 4 Jun 2023 • Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng
We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding.
1 code implementation • 30 May 2023 • Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues.
1 code implementation • 30 May 2023 • Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
no code implementations • 19 May 2023 • Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang
Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1. 5k hours (75%) at most.
no code implementations • 19 May 2023 • Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang
Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.
no code implementations • 18 May 2023 • Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, YuanYuan Huo, Yuxuan Wang
The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes.
no code implementations • 30 Dec 2022 • Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 12 Dec 2022 • Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, Yuxuan Wang
The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity.
no code implementations • 11 Nov 2022 • Yuxuan Wang, Feng Dong, Jinchao Zhu
However, most related works are based on RGB images, which lose massive useful information.
no code implementations • 27 Oct 2022 • Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang
In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 22 Oct 2022 • Xueliang Zhao, Yuxuan Wang, Chongyang Tao, Chenshuo Wang, Dongyan Zhao
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
no code implementations • 21 Sep 2022 • Huanhai Xin, Yuxuan Wang, Xia Chen, Eduardo Prieto-Araujo, Linbin Huang
Based on our analysis, we further study the problem of how to configure GFM converters in the grid and how many GFM converters we will need.
1 code implementation • 27 Aug 2022 • Giorgio Severi, Matthew Jagielski, Gökberk Yar, Yuxuan Wang, Alina Oprea, Cristina Nita-Rotaru
Federated learning is a popular strategy for training models on distributed, sensitive data, while preserving data privacy.
1 code implementation • 24 Aug 2022 • Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou
However, CL on VQA involves not only the expansion of label sets (new Answer sets).
1 code implementation • CVPR 2022 • Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc van Gool, Bernt Schiele, Federico Tombari, Fisher Yu
Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous driving systems.
1 code implementation • 12 Apr 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang
Speech restoration aims to remove distortions in speech signals.
1 code implementation • 1 Apr 2022 • Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou
In this paper, we introduce a new dataset called Kinetic-GEB+.
Ranked #1 on
Text to Video Retrieval
on Kinetics-GEB+
(text-to-video R@1 metric)
no code implementations • 10 Feb 2022 • Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee
We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.
2 code implementations • 30 Nov 2021 • Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou
In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).
no code implementations • NeurIPS 2021 • Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao
Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech.
1 code implementation • 13 Oct 2021 • Guangyi Yang, Yang Zhan., Yuxuan Wang
In order to fill this gap, we propose a deep adaptive superpixel-based network, namely DSN-IQA, to assess the quality of image based on multi-scale and superpixel segmentation.
no code implementations • 7 Oct 2021 • Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang
(2) How to clone a person's voice while controlling the style and prosody.
no code implementations • 1 Jul 2021 • Bochen Li, Yuxuan Wang, Zhiyao Duan
Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.
no code implementations • 27 May 2021 • Yu Chen, Yuxuan Wang, Bolin Lai, Zijie Chen, Xu Cao, Nanyang Ye, Zhongyuan Ren, Junbo Zhao, Xiao-Yun Zhou, Peng Qi
In the modern medical care, venipuncture is an indispensable procedure for both diagnosis and treatment.
no code implementations • 27 May 2021 • Xu Cao, Zijie Chen, Bolin Lai, Yuxuan Wang, Yu Chen, Zhengqing Cao, Zhilin Yang, Nanyang Ye, Junbo Zhao, Xiao-Yun Zhou, Peng Qi
For the automation, we focus on the positioning part and propose a Dual-In-Dual-Out network based on two-step learning and two-task learning, which can achieve fully automatic regression of the suitable puncture area and angle from near-infrared(NIR) images.
no code implementations • 26 Mar 2021 • Ju-Chiang Wang, Jordan B. L. Smith, Jitong Chen, Xuchen Song, Yuxuan Wang
This paper presents a novel supervised approach to detecting the chorus segments in popular music.
no code implementations • 26 Mar 2021 • Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan Wang
A music mashup combines audio elements from two or more songs to create a new work.
no code implementations • 19 Mar 2021 • Yuxuan Wang, Maokui He, Shutong Niu, Lei Sun, Tian Gao, Xin Fang, Jia Pan, Jun Du, Chin-Hui Lee
This system description describes our submission system to the Third DIHARD Speech Diarization Challenge.
no code implementations • 2 Mar 2021 • Keunwoo Choi, Yuxuan Wang
Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality.
no code implementations • 28 Oct 2020 • Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang
Music classification is a task to classify a music piece into labels such as genres or composers.
3 code implementations • 11 Oct 2020 • Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang
In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.
3 code implementations • 5 Oct 2020 • Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang
In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.
Sound Audio and Speech Processing
no code implementations • ACL 2020 • Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yu-Ping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei LI
This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation.
no code implementations • 26 May 2020 • Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang
Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.
no code implementations • 19 May 2020 • Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
no code implementations • 6 May 2020 • Xiang-Yang Li, Guo Pu, Keyu Ming, Pu Li, Jie Wang, Yuxuan Wang
In the traditional text style transfer model, the text style is generally relied on by experts knowledge and hand-designed rules, but with the application of deep learning in the field of natural language processing, the text style transfer method based on deep learning Started to be heavily researched.
no code implementations • 28 Apr 2020 • Shan Yang, Yuxuan Wang, Lei Xie
As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model.
no code implementations • 23 Apr 2020 • Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.
2 code implementations • 31 Jan 2020 • Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng
Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment.
no code implementations • 11 Nov 2019 • Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang
In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.
no code implementations • 11 Nov 2019 • Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma
In this paper, we propose a hybrid text normalization system using multi-head self-attention.
no code implementations • CONLL 2019 • Wanxiang Che, Longxu Dou, Yang Xu, Yuxuan Wang, Yijia Liu, Ting Liu
This paper describes our system (HIT-SCIR) for CoNLL 2019 shared task: Cross-Framework Meaning Representation Parsing.
Ranked #1 on
UCCA Parsing
on CoNLL 2019
1 code implementation • IJCNLP 2019 • Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, Ting Liu
In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages.
2 code implementations • ICLR 2019 • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.
no code implementations • 30 Aug 2018 • Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan
We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.
no code implementations • 4 Aug 2018 • Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan
GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.
1 code implementation • CONLL 2018 • Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, Ting Liu
This paper describes our system (HIT-SCIR) submitted to the CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies.
Ranked #3 on
Dependency Parsing
on Universal Dependencies
2 code implementations • ICML 2018 • RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.
11 code implementations • ICML 2018 • Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.
30 code implementations • 16 Dec 2017 • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.
Ranked #2 on
Speech Synthesis
on North American English
no code implementations • 1 Nov 2017 • Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous
Prosodic modeling is a core problem in speech synthesis.
no code implementations • CONLL 2017 • Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, Ting Liu
Our system includes three pipelined components: \textit{tokenization}, \textit{Part-of-Speech} (POS) \textit{tagging} and \textit{dependency parsing}.
29 code implementations • 29 Mar 2017 • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #5 on
Speech Synthesis
on North American English
2 code implementations • 19 Jul 2016 • Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous
Robust and far-field speech recognition is critical to enable true hands-free communication.
no code implementations • NeurIPS 2012 • Yuxuan Wang, DeLiang Wang
While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison.