GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval

no code implementations1 Apr 2022 Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou

Cognitive science has shown that humans perceive videos in terms of events separated by state changes of dominant subjects.

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

no code implementations10 Feb 2022 Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.

Action Detection Activity Detection +1

Neural Dubber: Dubbing for Videos According to Scripts

no code implementations NeurIPS 2021 Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao

Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech.

Deep Superpixel-based Network for Blind Image Quality Assessment

1 code implementation13 Oct 2021 Guangyi Yang, Yang Zhan., Yuxuan Wang

In order to fill this gap, we propose a deep adaptive superpixel-based network, namely DSN-IQA, to assess the quality of image based on multi-scale and superpixel segmentation.

Blind Image Quality Assessment

Audiovisual Singing Voice Separation

no code implementations1 Jul 2021 Bochen Li, Yuxuan Wang, Zhiyao Duan

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.

VeniBot: Towards Autonomous Venipuncture with Automatic Puncture Area and Angle Regression from NIR Images

no code implementations27 May 2021 Xu Cao, Zijie Chen, Bolin Lai, Yuxuan Wang, Yu Chen, Zhengqing Cao, Zhilin Yang, Nanyang Ye, Junbo Zhao, Xiao-Yun Zhou, Peng Qi

For the automation, we focus on the positioning part and propose a Dual-In-Dual-Out network based on two-step learning and two-task learning, which can achieve fully automatic regression of the suitable puncture area and angle from near-infrared(NIR) images.

Modeling the Compatibility of Stem Tracks to Generate Music Mashups

no code implementations26 Mar 2021 Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan Wang

A music mashup combines audio elements from two or more songs to create a new work.

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

no code implementations2 Mar 2021 Keunwoo Choi, Yuxuan Wang

Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality.

Language Identification

Large-Scale MIDI-based Composer Classification

no code implementations28 Oct 2020 Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang

Music classification is a task to classify a music piece into labels such as genres or composers.

Classification General Classification +1

GiantMIDI-Piano: A large-scale MIDI dataset for classical piano music

4 code implementations11 Oct 2020 Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang

In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.

Information Retrieval Music Information Retrieval

High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

4 code implementations5 Oct 2020 Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang

In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.

Sound Audio and Speech Processing

Xiaomingbot: A Multilingual Robot News Reporter

no code implementations ACL 2020 Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yu-Ping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei LI

This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation.

News Generation Translation

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

no code implementations26 May 2020 Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang

Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.

Speech Enhancement Speech Synthesis

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

no code implementations19 May 2020 Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.

Review of Text Style Transfer Based on Deep Learning

no code implementations6 May 2020 Xiang-Yang Li, Guo Pu, Keyu Ming, Pu Li, Jie Wang, Yuxuan Wang

In the traditional text style transfer model, the text style is generally relied on by experts knowledge and hand-designed rules, but with the application of deep learning in the field of natural language processing, the text style transfer method based on deep learning Started to be heavily researched.

Natural Language Processing Style Transfer +1

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

no code implementations28 Apr 2020 Shan Yang, Yuxuan Wang, Lei Xie

As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model.

Data Augmentation Denoising +3

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

no code implementations23 Apr 2020 Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.

Convolutional Embedding for Edit Distance

2 code implementations31 Jan 2020 Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment.

A hybrid text normalization system using multi-head self-attention for mandarin

no code implementations11 Nov 2019 Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma

In this paper, we propose a hybrid text normalization system using multi-head self-attention.

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

no code implementations11 Nov 2019 Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang

In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.

Polyphone disambiguation Speech Synthesis +1

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

1 code implementation IJCNLP 2019 Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, Ting Liu

In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages.

Dependency Parsing Language Modelling +2

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations ICLR 2019 Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Speech Synthesis

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations30 Aug 2018 Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Speech Synthesis

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

no code implementations4 Aug 2018 Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.

Speech Synthesis Text-To-Speech Synthesis

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

1 code implementation ICML 2018 RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.

Expressive Speech Synthesis

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

10 code implementations ICML 2018 Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.

Speech Synthesis Style Transfer +1

The HIT-SCIR System for End-to-End Parsing of Universal Dependencies

no code implementations CONLL 2017 Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, Ting Liu

Our system includes three pipelined components: \textit{tokenization}, \textit{Part-of-Speech} (POS) \textit{tagging} and \textit{dependency parsing}.

Dependency Parsing Information Retrieval +3

Cocktail Party Processing via Structured Prediction

no code implementations NeurIPS 2012 Yuxuan Wang, DeLiang Wang

While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison.

General Classification Speech Separation +1

