Search Results for author: Yuxuan Wang

Found 66 papers, 27 papers with code

Simple and Effective Graph-to-Graph Annotation Conversion

1 code implementation COLING 2022 Yuxuan Wang, Zhilin Lei, Yuqiu Ji, Wanxiang Che

Annotation conversion is an effective way to construct datasets under new annotation guidelines based on existing datasets with little human labour.

Audio Prompt Tuning for Universal Sound Separation

1 code implementation30 Nov 2023 Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang

Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen.

LLaMA Rider: Spurring Large Language Models to Explore the Open World

no code implementations13 Oct 2023 Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu

Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.

Decision Making

Teaching Text-to-Image Models to Communicate

no code implementations27 Sep 2023 Xiaowen Sun, Jiazhan Feng, Yuxuan Wang, Yuxuan Lai, Xingyu Shen, Dongyan Zhao

Various works have been extensively studied in the research of text-to-image generation.

Separate Anything You Describe

1 code implementation9 Aug 2023 Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.

Audio Source Separation Natural Language Queries +1

Query Encoder Distillation via Embedding Alignment is a Strong Baseline Method to Boost Dense Retriever Online Efficiency

1 code implementation5 Jun 2023 Yuxuan Wang, Hong Lyu

The information retrieval community has made significant progress in improving the efficiency of Dual Encoder (DE) dense passage retrieval systems, making them suitable for latency-sensitive settings.

Passage Retrieval Retrieval

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

no code implementations4 Jun 2023 Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng

We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding.

Benchmarking Contrastive Learning +1

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

1 code implementation30 May 2023 Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues.

Dialogue Generation Dialogue Understanding +2

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

1 code implementation30 May 2023 Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.

Contrastive Learning

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

no code implementations19 May 2023 Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1. 5k hours (75%) at most.

Self-Supervised Learning speech-recognition +1

Language-universal phonetic encoder for low-resource speech recognition

no code implementations19 May 2023 Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.

speech-recognition Speech Recognition

a unified front-end framework for english text-to-speech synthesis

no code implementations18 May 2023 Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, YuanYuan Huo, Yuxuan Wang

The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes.

Speech Synthesis Text-To-Speech Synthesis

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

no code implementations12 Dec 2022 Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, Yuxuan Wang

The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity.

Data Augmentation Disentanglement

How Many Grid-Forming Converters do We Need? A Perspective From Small Signal Stability and Power Grid Strength

no code implementations21 Sep 2022 Huanhai Xin, Yuxuan Wang, Xia Chen, Eduardo Prieto-Araujo, Linbin Huang

Based on our analysis, we further study the problem of how to configure GFM converters in the grid and how many GFM converters we will need.

Network-Level Adversaries in Federated Learning

1 code implementation27 Aug 2022 Giorgio Severi, Matthew Jagielski, Gökberk Yar, Yuxuan Wang, Alina Oprea, Cristina Nita-Rotaru

Federated learning is a popular strategy for training models on distributed, sensitive data, while preserving data privacy.

Federated Learning

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

no code implementations10 Feb 2022 Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.

Action Detection Activity Detection +2

Neural Dubber: Dubbing for Videos According to Scripts

no code implementations NeurIPS 2021 Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao

Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech.

Deep Superpixel-based Network for Blind Image Quality Assessment

1 code implementation13 Oct 2021 Guangyi Yang, Yang Zhan., Yuxuan Wang

In order to fill this gap, we propose a deep adaptive superpixel-based network, namely DSN-IQA, to assess the quality of image based on multi-scale and superpixel segmentation.

Blind Image Quality Assessment

Audiovisual Singing Voice Separation

no code implementations1 Jul 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.

VeniBot: Towards Autonomous Venipuncture with Automatic Puncture Area and Angle Regression from NIR Images

no code implementations27 May 2021 Xu Cao, Zijie Chen, Bolin Lai, Yuxuan Wang, Yu Chen, Zhengqing Cao, Zhilin Yang, Nanyang Ye, Junbo Zhao, Xiao-Yun Zhou, Peng Qi

For the automation, we focus on the positioning part and propose a Dual-In-Dual-Out network based on two-step learning and two-task learning, which can achieve fully automatic regression of the suitable puncture area and angle from near-infrared(NIR) images.

Navigate regression

Modeling the Compatibility of Stem Tracks to Generate Music Mashups

no code implementations26 Mar 2021 Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan Wang

A music mashup combines audio elements from two or more songs to create a new work.

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

no code implementations2 Mar 2021 Keunwoo Choi, Yuxuan Wang

Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality.

Language Identification

Large-Scale MIDI-based Composer Classification

no code implementations28 Oct 2020 Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang

Music classification is a task to classify a music piece into labels such as genres or composers.

Classification General Classification +1

GiantMIDI-Piano: A large-scale MIDI dataset for classical piano music

3 code implementations11 Oct 2020 Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang

In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.

Information Retrieval Music Information Retrieval +1

High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

3 code implementations5 Oct 2020 Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang

In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.

Sound Audio and Speech Processing

Xiaomingbot: A Multilingual Robot News Reporter

no code implementations ACL 2020 Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yu-Ping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei LI

This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation.

News Generation Translation +1

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

no code implementations26 May 2020 Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang

Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.

Speech Enhancement Speech Synthesis

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

no code implementations19 May 2020 Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.

Review of Text Style Transfer Based on Deep Learning

no code implementations6 May 2020 Xiang-Yang Li, Guo Pu, Keyu Ming, Pu Li, Jie Wang, Yuxuan Wang

In the traditional text style transfer model, the text style is generally relied on by experts knowledge and hand-designed rules, but with the application of deep learning in the field of natural language processing, the text style transfer method based on deep learning Started to be heavily researched.

Style Transfer Text Style Transfer

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

no code implementations28 Apr 2020 Shan Yang, Yuxuan Wang, Lei Xie

As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model.

Clustering Data Augmentation +5

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

no code implementations23 Apr 2020 Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.

Singing Voice Synthesis

Convolutional Embedding for Edit Distance

2 code implementations31 Jan 2020 Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment.

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

no code implementations11 Nov 2019 Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang

In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.

Polyphone disambiguation Speech Synthesis +1

A hybrid text normalization system using multi-head self-attention for mandarin

no code implementations11 Nov 2019 Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma

In this paper, we propose a hybrid text normalization system using multi-head self-attention.

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

1 code implementation IJCNLP 2019 Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, Ting Liu

In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages.

Dependency Parsing Language Modelling +2

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations ICLR 2019 Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Speech Synthesis

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations30 Aug 2018 Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Speech Synthesis

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

no code implementations4 Aug 2018 Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.

Speech Synthesis Text-To-Speech Synthesis

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

2 code implementations ICML 2018 RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.

Expressive Speech Synthesis

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

11 code implementations ICML 2018 Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.

Speech Synthesis Style Transfer +1

The HIT-SCIR System for End-to-End Parsing of Universal Dependencies

no code implementations CONLL 2017 Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, Ting Liu

Our system includes three pipelined components: \textit{tokenization}, \textit{Part-of-Speech} (POS) \textit{tagging} and \textit{dependency parsing}.

Dependency Parsing Information Retrieval +4

Cocktail Party Processing via Structured Prediction

no code implementations NeurIPS 2012 Yuxuan Wang, DeLiang Wang

While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison.

General Classification Speech Separation +1

Cannot find the paper you are looking for? You can Submit a new open access paper.