29 code implementations • 29 Mar 2017 • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
Ranked #5 on Speech Synthesis on North American English
30 code implementations • 16 Dec 2017 • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.
Ranked #2 on Speech Synthesis on North American English
11 code implementations • ICML 2018 • Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.
1 code implementation • 10 Aug 2023 • Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley
Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.
Ranked #3 on Audio Generation on AudioCaps
3 code implementations • 11 Oct 2020 • Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang
In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.
3 code implementations • 5 Oct 2020 • Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang
In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.
Sound Audio and Speech Processing
1 code implementation • CONLL 2018 • Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, Ting Liu
This paper describes our system (HIT-SCIR) submitted to the CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies.
Ranked #3 on Dependency Parsing on Universal Dependencies
1 code implementation • 9 Aug 2023 • Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.
1 code implementation • 12 Apr 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang
Speech restoration aims to remove distortions in speech signals.
2 code implementations • ICML 2018 • RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.
2 code implementations • 31 Jan 2020 • Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng
Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment.
1 code implementation • CVPR 2022 • Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc van Gool, Bernt Schiele, Federico Tombari, Fisher Yu
Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous driving systems.
2 code implementations • 19 Jul 2016 • Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous
Robust and far-field speech recognition is critical to enable true hands-free communication.
1 code implementation • IJCNLP 2019 • Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, Ting Liu
In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages.
1 code implementation • 24 Aug 2022 • Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou
However, CL on VQA involves not only the expansion of label sets (new Answer sets).
2 code implementations • 30 Nov 2021 • Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou
In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).
1 code implementation • 15 Mar 2024 • Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.
Ranked #3 on Video Question Answering on MVBench
1 code implementation • 1 Apr 2022 • Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou
In this paper, we introduce a new dataset called Kinetic-GEB+.
Ranked #1 on Boundary Captioning on Kinetics-GEB+
1 code implementation • 25 Feb 2024 • Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng
Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time.
2 code implementations • ICLR 2019 • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.
1 code implementation • 22 Aug 2023 • Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu, Pingchuan Tian, Yuping Wang, Yuxuan Wang
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP).
1 code implementation • 30 May 2023 • Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues.
1 code implementation • 5 Jun 2023 • Yuxuan Wang, Hong Lyu
The information retrieval community has made significant progress in improving the efficiency of Dual Encoder (DE) dense passage retrieval systems, making them suitable for latency-sensitive settings.
1 code implementation • 13 Oct 2021 • Guangyi Yang, Yang Zhan., Yuxuan Wang
In order to fill this gap, we propose a deep adaptive superpixel-based network, namely DSN-IQA, to assess the quality of image based on multi-scale and superpixel segmentation.
1 code implementation • 30 May 2023 • Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
1 code implementation • 8 Jan 2024 • Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao
However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos.
1 code implementation • 27 Aug 2022 • Giorgio Severi, Matthew Jagielski, Gökberk Yar, Yuxuan Wang, Alina Oprea, Cristina Nita-Rotaru
Federated learning is a popular strategy for training models on distributed, sensitive data, while preserving data privacy.
1 code implementation • COLING 2022 • Yuxuan Wang, Zhilin Lei, Yuqiu Ji, Wanxiang Che
Annotation conversion is an effective way to construct datasets under new annotation guidelines based on existing datasets with little human labour.
no code implementations • 1 Nov 2017 • Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous
Prosodic modeling is a core problem in speech synthesis.
no code implementations • 4 Aug 2018 • Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan
GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.
no code implementations • 30 Aug 2018 • Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan
We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.
no code implementations • CONLL 2017 • Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, Ting Liu
Our system includes three pipelined components: \textit{tokenization}, \textit{Part-of-Speech} (POS) \textit{tagging} and \textit{dependency parsing}.
no code implementations • NeurIPS 2012 • Yuxuan Wang, DeLiang Wang
While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison.
no code implementations • CONLL 2019 • Wanxiang Che, Longxu Dou, Yang Xu, Yuxuan Wang, Yijia Liu, Ting Liu
This paper describes our system (HIT-SCIR) for CoNLL 2019 shared task: Cross-Framework Meaning Representation Parsing.
Ranked #1 on UCCA Parsing on CoNLL 2019
no code implementations • 11 Nov 2019 • Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma
In this paper, we propose a hybrid text normalization system using multi-head self-attention.
no code implementations • 11 Nov 2019 • Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang
In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.
no code implementations • 23 Apr 2020 • Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.
no code implementations • 28 Apr 2020 • Shan Yang, Yuxuan Wang, Lei Xie
As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model.
no code implementations • 6 May 2020 • Xiang-Yang Li, Guo Pu, Keyu Ming, Pu Li, Jie Wang, Yuxuan Wang
In the traditional text style transfer model, the text style is generally relied on by experts knowledge and hand-designed rules, but with the application of deep learning in the field of natural language processing, the text style transfer method based on deep learning Started to be heavily researched.
no code implementations • 19 May 2020 • Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
no code implementations • 26 May 2020 • Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang
Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.
no code implementations • ACL 2020 • Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yu-Ping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei LI
This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation.
no code implementations • 28 Oct 2020 • Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang
Music classification is a task to classify a music piece into labels such as genres or composers.
no code implementations • 2 Mar 2021 • Keunwoo Choi, Yuxuan Wang
Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality.
no code implementations • 19 Mar 2021 • Yuxuan Wang, Maokui He, Shutong Niu, Lei Sun, Tian Gao, Xin Fang, Jia Pan, Jun Du, Chin-Hui Lee
This system description describes our submission system to the Third DIHARD Speech Diarization Challenge.
no code implementations • 26 Mar 2021 • Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan Wang
A music mashup combines audio elements from two or more songs to create a new work.
no code implementations • 26 Mar 2021 • Ju-Chiang Wang, Jordan B. L. Smith, Jitong Chen, Xuchen Song, Yuxuan Wang
This paper presents a novel supervised approach to detecting the chorus segments in popular music.
no code implementations • 27 May 2021 • Xu Cao, Zijie Chen, Bolin Lai, Yuxuan Wang, Yu Chen, Zhengqing Cao, Zhilin Yang, Nanyang Ye, Junbo Zhao, Xiao-Yun Zhou, Peng Qi
For the automation, we focus on the positioning part and propose a Dual-In-Dual-Out network based on two-step learning and two-task learning, which can achieve fully automatic regression of the suitable puncture area and angle from near-infrared(NIR) images.
no code implementations • 27 May 2021 • Yu Chen, Yuxuan Wang, Bolin Lai, Zijie Chen, Xu Cao, Nanyang Ye, Zhongyuan Ren, Junbo Zhao, Xiao-Yun Zhou, Peng Qi
In the modern medical care, venipuncture is an indispensable procedure for both diagnosis and treatment.
no code implementations • 1 Jul 2021 • Bochen Li, Yuxuan Wang, Zhiyao Duan
Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.
no code implementations • 7 Oct 2021 • Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang
(2) How to clone a person's voice while controlling the style and prosody.
no code implementations • NeurIPS 2021 • Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao
Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech.
no code implementations • 10 Feb 2022 • Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee
We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.
no code implementations • 21 Sep 2022 • Huanhai Xin, Yuxuan Wang, Xia Chen, Eduardo Prieto-Araujo, Linbin Huang
Based on our analysis, we further study the problem of how to configure GFM converters in the grid and how many GFM converters we will need.
no code implementations • 22 Oct 2022 • Xueliang Zhao, Yuxuan Wang, Chongyang Tao, Chenshuo Wang, Dongyan Zhao
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
no code implementations • 27 Oct 2022 • Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang
In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 11 Nov 2022 • Yuxuan Wang, Feng Dong, Jinchao Zhu
However, most related works are based on RGB images, which lose massive useful information.
no code implementations • 12 Dec 2022 • Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, Yuxuan Wang
The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity.
no code implementations • 30 Dec 2022 • Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR).
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 18 May 2023 • Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, YuanYuan Huo, Yuxuan Wang
The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes.
no code implementations • 19 May 2023 • Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang
Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.
no code implementations • 19 May 2023 • Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang
Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1. 5k hours (75%) at most.
no code implementations • 4 Jun 2023 • Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng
We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding.
no code implementations • 5 Jun 2023 • Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
no code implementations • 28 Aug 2023 • Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song
We tested InstructME in instrument-editing, remixing, and multi-round editing.
no code implementations • 27 Sep 2023 • Xiaowen Sun, Jiazhan Feng, Yuxuan Wang, Yuxuan Lai, Xingyu Shen, Dongyan Zhao
In this paper, we focus on the innovative dialog-to-image generation task, where the model synthesizes a high-resolution image aligned with the given dialog context as a response.
no code implementations • 13 Oct 2023 • Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu
Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.
1 code implementation • 30 Nov 2023 • Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang
Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen.
no code implementations • 24 Dec 2023 • Jinchao Zhu, Yuxuan Wang, Xiaobing Tu, Siyuan Pan, Pengfei Wan, Gao Huang
The Stable Diffusion Model (SDM) is a popular and efficient text-to-image (t2i) generation and image-to-image (i2i) generation model.
no code implementations • 19 Jan 2024 • Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Zhuo Chen, Lei Xie, Yuping Wang, Yuxuan Wang
Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech.
no code implementations • 4 Feb 2024 • Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Yunzhong Qiu, Li Zhang, Jianmin Wang, Mingsheng Long
To emphasize temporal correlation modeling, this paper proposes TimeSiam as a simple but effective self-supervised pre-training framework for Time series based on Siamese networks.
no code implementations • 29 Feb 2024 • Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong liu, Yunzhong Qiu, Haoran Zhang, Jianmin Wang, Mingsheng Long
Experimentally, TimeXer significantly improves time series forecasting with exogenous variables and achieves consistent state-of-the-art performance in twelve real-world forecasting benchmarks.
no code implementations • 18 Mar 2024 • Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang
The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations.
no code implementations • 24 Mar 2024 • Yuxuan Wang, Xiaoyuan Liu
After that, we ensemble VLMs with SGG models to enhance representation.
no code implementations • 10 Apr 2024 • Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma
We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre.