Search Results for author: Yuxuan Wang

Found 76 papers, 30 papers with code

Tacotron: Towards End-to-End Speech Synthesis

29 code implementations • 29 Mar 2017 • Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Ranked #5 on Speech Synthesis on North American English

Audio Synthesis Speech Synthesis +1

50,652

Paper
Code

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

30 code implementations • 16 Dec 2017 • Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.

Ranked #2 on Speech Synthesis on North American English

Speech Synthesis

28,968

Paper
Code

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

11 code implementations • ICML 2018 • Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.

Speech Synthesis Style Transfer +1

10,084

Paper
Code

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

1 code implementation • 10 Aug 2023 • Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model.

Ranked #3 on Audio Generation on AudioCaps

Audio Generation In-Context Learning +2

2,032

Paper
Code

GiantMIDI-Piano: A large-scale MIDI dataset for classical piano music

3 code implementations • 11 Oct 2020 • Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang

In this article, we create a GiantMIDI-Piano (GP) dataset containing 38, 700, 838 transcribed notes and 10, 855 unique solo piano works composed by 2, 786 composers.

Information Retrieval Music Information Retrieval +1

1,609

Paper
Code

High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

3 code implementations • 5 Oct 2020 • Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang

In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings.

Sound Audio and Speech Processing

1,527

Paper
Code

Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation

1 code implementation • CONLL 2018 • Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, Ting Liu

This paper describes our system (HIT-SCIR) submitted to the CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies.

Ranked #3 on Dependency Parsing on Universal Dependencies

Dependency Parsing Word Embeddings

1,453

Paper
Code

Separate Anything You Describe

1 code implementation • 9 Aug 2023 • Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries.

Audio Source Separation Natural Language Queries +2

1,425

Paper
Code

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

1 code implementation • 12 Apr 2022 • Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang

Speech restoration aims to remove distortions in speech signals.

Speech Denoising Speech Enhancement +1

900

Paper
Code

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

2 code implementations • ICML 2018 • RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody.

Expressive Speech Synthesis

368

Paper
Code

Convolutional Embedding for Edit Distance

2 code implementations • 31 Jan 2020 • Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment.

139

Paper
Code

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

1 code implementation • CVPR 2022 • Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc van Gool, Bernt Schiele, Federico Tombari, Fisher Yu

Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous driving systems.

Autonomous Driving Domain Adaptation

Paper
Code

Trainable Frontend For Robust and Far-Field Keyword Spotting

2 code implementations • 19 Jul 2016 • Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous

Robust and far-field speech recognition is critical to enable true hands-free communication.

Keyword Spotting speech-recognition +1

Paper
Code

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

1 code implementation • IJCNLP 2019 • Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, Ting Liu

In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages.

Dependency Parsing Language Modelling +2

Paper
Code

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

1 code implementation • 24 Aug 2022 • Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou

However, CL on VQA involves not only the expansion of label sets (new Answer sets).

Continual Learning Question Answering +1

Paper
Code

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

2 code implementations • 30 Nov 2021 • Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou

In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR).

Question Answering Retrieval +2

Paper
Code

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

1 code implementation • 15 Mar 2024 • Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao

Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos.

Ranked #3 on Video Question Answering on MVBench

Video Grounding Video Question Answering

Paper
Code

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

1 code implementation • 1 Apr 2022 • Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou

In this paper, we introduce a new dataset called Kinetic-GEB+.

Ranked #1 on Boundary Captioning on Kinetics-GEB+

Boundary Captioning Boundary Grounding +2

Paper
Code

LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding

1 code implementation • 25 Feb 2024 • Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Zilong Zheng

Despite progress in video-language modeling, the computational challenge of interpreting long-form videos in response to task-specific linguistic queries persists, largely due to the complexity of high-dimensional video data and the misalignment between language and visual cues over space and time.

Computational Efficiency Language Modelling +3

Paper
Code

Hierarchical Generative Modeling for Controllable Speech Synthesis

2 code implementations • ICLR 2019 • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions.

Attribute Speech Synthesis

Paper
Code

Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models

1 code implementation • 22 Aug 2023 • Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu, Pingchuan Tian, Yuping Wang, Yuxuan Wang

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP).

Paper
Code

VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions

1 code implementation • 30 May 2023 • Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, Dongyan Zhao

Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues.

Dialogue Generation Dialogue Understanding +2

Paper
Code

Query Encoder Distillation via Embedding Alignment is a Strong Baseline Method to Boost Dense Retriever Online Efficiency

1 code implementation • 5 Jun 2023 • Yuxuan Wang, Hong Lyu

The information retrieval community has made significant progress in improving the efficiency of Dual Encoder (DE) dense passage retrieval systems, making them suitable for latency-sensitive settings.

Passage Retrieval Retrieval

Paper
Code

Deep Superpixel-based Network for Blind Image Quality Assessment

1 code implementation • 13 Oct 2021 • Guangyi Yang, Yang Zhan., Yuxuan Wang

In order to fill this gap, we propose a deep adaptive superpixel-based network, namely DSN-IQA, to assess the quality of image based on multi-scale and superpixel segmentation.

Blind Image Quality Assessment

Paper
Code

A Closer Look into the Robustness of Neural Dependency Parsers Using Better Adversarial Examples

1 code implementation • Findings (ACL) 2021 • Yuxuan Wang, Wanxiang Che, Ivan Titov, Shay B. Cohen, Zhilin Lei, Ting Liu

Paper
Code

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

1 code implementation • 30 May 2023 • Yuxuan Wang, Jianghui Wang, Dongyan Zhao, Zilong Zheng

We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.

Contrastive Learning

Paper
Code

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

1 code implementation • 8 Jan 2024 • Yueqian Wang, Yuxuan Wang, Kai Chen, Dongyan Zhao

However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos.

Question Answering Video Question Answering

Paper
Code

Network-Level Adversaries in Federated Learning

1 code implementation • 27 Aug 2022 • Giorgio Severi, Matthew Jagielski, Gökberk Yar, Yuxuan Wang, Alina Oprea, Cristina Nita-Rotaru

Federated learning is a popular strategy for training models on distributed, sensitive data, while preserving data privacy.

Federated Learning

Paper
Code

Simple and Effective Graph-to-Graph Annotation Conversion

1 code implementation • COLING 2022 • Yuxuan Wang, Zhilin Lei, Yuqiu Ji, Wanxiang Che

Annotation conversion is an effective way to construct datasets under new annotation guidelines based on existing datasets with little human labour.

Paper
Code

Uncovering Latent Style Factors for Expressive Speech Synthesis

no code implementations • 1 Nov 2017 • Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous

Prosodic modeling is a core problem in speech synthesis.

Expressive Speech Synthesis

Paper
Add Code

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

no code implementations • 4 Aug 2018 • Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis

no code implementations • 30 Aug 2018 • Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, RJ Skerry-Ryan

We demonstrate that the proposed framework enables Tacotron to generate intelligible speech using less than half an hour of paired training data.

Speech Synthesis

Paper
Add Code

The HIT-SCIR System for End-to-End Parsing of Universal Dependencies

no code implementations • CONLL 2017 • Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, Ting Liu

Our system includes three pipelined components: \textit{tokenization}, \textit{Part-of-Speech} (POS) \textit{tagging} and \textit{dependency parsing}.

Dependency Parsing Information Retrieval +4

Paper
Add Code

Cocktail Party Processing via Structured Prediction

no code implementations • NeurIPS 2012 • Yuxuan Wang, DeLiang Wang

While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison.

General Classification Speech Separation +1

Paper
Add Code

HIT-SCIR at MRP 2019: A Unified Pipeline for Meaning Representation Parsing via Efficient Training and Effective Encoding

no code implementations • CONLL 2019 • Wanxiang Che, Longxu Dou, Yang Xu, Yuxuan Wang, Yijia Liu, Ting Liu

This paper describes our system (HIT-SCIR) for CoNLL 2019 shared task: Cross-Framework Meaning Representation Parsing.

Ranked #1 on UCCA Parsing on CoNLL 2019

UCCA Parsing Word Embeddings

Paper
Add Code

A hybrid text normalization system using multi-head self-attention for mandarin

no code implementations • 11 Nov 2019 • Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma

In this paper, we propose a hybrid text normalization system using multi-head self-attention.

Sentence

Paper
Add Code

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

no code implementations • 11 Nov 2019 • Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang

In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.

Polyphone disambiguation Speech Synthesis +1

Paper
Add Code

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

no code implementations • 23 Apr 2020 • Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.

Singing Voice Synthesis

Paper
Add Code

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

no code implementations • 28 Apr 2020 • Shan Yang, Yuxuan Wang, Lei Xie

As for the speech-side noise, we propose to learn a noise-independent feature in the auto-regressive decoder through adversarial training and data augmentation, which does not need an extra speech enhancement model.

Clustering Data Augmentation +5

Paper
Add Code

Review of Text Style Transfer Based on Deep Learning

no code implementations • 6 May 2020 • Xiang-Yang Li, Guo Pu, Keyu Ming, Pu Li, Jie Wang, Yuxuan Wang

In the traditional text style transfer model, the text style is generally relied on by experts knowledge and hand-designed rules, but with the application of deep learning in the field of natural language processing, the text style transfer method based on deep learning Started to be heavily researched.

Style Transfer Text Style Transfer

Paper
Add Code

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

no code implementations • 19 May 2020 • Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.

Paper
Add Code

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

no code implementations • 26 May 2020 • Dongyang Dai, Li Chen, Yu-Ping Wang, Mu Wang, Rui Xia, Xuchen Song, Zhiyong Wu, Yuxuan Wang

Firstly, the speech synthesis model is pre-trained with both multi-speaker clean data and noisy augmented data; then the pre-trained model is adapted on noisy low-resource new speaker data; finally, by setting the clean speech condition, the model can synthesize the new speaker's clean voice.

Speech Enhancement Speech Synthesis

Paper
Add Code

Xiaomingbot: A Multilingual Robot News Reporter

no code implementations • ACL 2020 • Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yu-Ping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei LI

This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation.

News Generation Translation +1

Paper
Add Code

Large-Scale MIDI-based Composer Classification

no code implementations • 28 Oct 2020 • Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang

Music classification is a task to classify a music piece into labels such as genres or composers.

Classification General Classification +1

Paper
Add Code

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

no code implementations • 2 Mar 2021 • Keunwoo Choi, Yuxuan Wang

Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality.

Language Identification

Paper
Add Code

USTC-NELSLIP System Description for DIHARD-III Challenge

no code implementations • 19 Mar 2021 • Yuxuan Wang, Maokui He, Shutong Niu, Lei Sun, Tian Gao, Xin Fang, Jia Pan, Jun Du, Chin-Hui Lee

This system description describes our submission system to the Third DIHARD Speech Diarization Challenge.

Action Detection Activity Detection +3

Paper
Add Code

Modeling the Compatibility of Stem Tracks to Generate Music Mashups

no code implementations • 26 Mar 2021 • Jiawen Huang, Ju-Chiang Wang, Jordan B. L. Smith, Xuchen Song, Yuxuan Wang

A music mashup combines audio elements from two or more songs to create a new work.

Paper
Add Code

Supervised Chorus Detection for Popular Music Using Convolutional Neural Network and Multi-task Learning

no code implementations • 26 Mar 2021 • Ju-Chiang Wang, Jordan B. L. Smith, Jitong Chen, Xuchen Song, Yuxuan Wang

This paper presents a novel supervised approach to detecting the chorus segments in popular music.

Multi-Task Learning

Paper
Add Code

VeniBot: Towards Autonomous Venipuncture with Automatic Puncture Area and Angle Regression from NIR Images

no code implementations • 27 May 2021 • Xu Cao, Zijie Chen, Bolin Lai, Yuxuan Wang, Yu Chen, Zhengqing Cao, Zhilin Yang, Nanyang Ye, Junbo Zhao, Xiao-Yun Zhou, Peng Qi

For the automation, we focus on the positioning part and propose a Dual-In-Dual-Out network based on two-step learning and two-task learning, which can achieve fully automatic regression of the suitable puncture area and angle from near-infrared(NIR) images.

Navigate regression

Paper
Add Code

VeniBot: Towards Autonomous Venipuncture with Semi-supervised Vein Segmentation from Ultrasound Images

no code implementations • 27 May 2021 • Yu Chen, Yuxuan Wang, Bolin Lai, Zijie Chen, Xu Cao, Nanyang Ye, Zhongyuan Ren, Junbo Zhao, Xiao-Yun Zhou, Peng Qi

In the modern medical care, venipuncture is an indispensable procedure for both diagnosis and treatment.

Segmentation

Paper
Add Code

Audiovisual Singing Voice Separation

no code implementations • 1 Jul 2021 • Bochen Li, Yuxuan Wang, Zhiyao Duan

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques.

Paper
Add Code

Cloning one's voice using very limited data in the wild

no code implementations • 7 Oct 2021 • Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang

(2) How to clone a person's voice while controlling the style and prosody.

Speech Synthesis

Paper
Add Code

Neural Dubber: Dubbing for Videos According to Scripts

no code implementations • NeurIPS 2021 • Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao

Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech.

Paper
Add Code

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

no code implementations • 10 Feb 2022 • Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.

Action Detection Activity Detection +2

Paper
Add Code

How Many Grid-Forming Converters do We Need? A Perspective From Small Signal Stability and Power Grid Strength

no code implementations • 21 Sep 2022 • Huanhai Xin, Yuxuan Wang, Xia Chen, Eduardo Prieto-Araujo, Linbin Huang

Based on our analysis, we further study the problem of how to configure GFM converters in the grid and how many GFM converters we will need.

Paper
Add Code

Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation

no code implementations • 22 Oct 2022 • Xueliang Zhao, Yuxuan Wang, Chongyang Tao, Chenshuo Wang, Dongyan Zhao

We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.

Dialogue Generation Multi-agent Reinforcement Learning

Paper
Add Code

Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

no code implementations • 27 Oct 2022 • Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang

In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Interactive Context-Aware Network for RGB-T Salient Object Detection

no code implementations • 11 Nov 2022 • Yuxuan Wang, Feng Dong, Jinchao Zhu

However, most related works are based on RGB images, which lose massive useful information.

object-detection Object Detection +1

Paper
Add Code

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

no code implementations • 12 Dec 2022 • Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuanzhe Chen, Mingbo Ma, Yuping Wang, Yuxuan Wang

The goal of accent conversion (AC) is to convert the accent of speech into the target accent while preserving the content and speaker identity.

Data Augmentation Disentanglement

Paper
Add Code

Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

no code implementations • 30 Dec 2022 • Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

A unified front-end framework for English text-to-speech synthesis

no code implementations • 18 May 2023 • Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, YuanYuan Huo, Yuxuan Wang

The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes.

Speech Synthesis Text-To-Speech Synthesis

Paper
Add Code

Language-universal phonetic encoder for low-resource speech recognition

no code implementations • 19 May 2023 • Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.

speech-recognition Speech Recognition

Paper
Add Code

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

no code implementations • 19 May 2023 • Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1. 5k hours (75%) at most.

Self-Supervised Learning speech-recognition +1

Paper
Add Code

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

no code implementations • 4 Jun 2023 • Jianghui Wang, Yuxuan Wang, Dongyan Zhao, Zilong Zheng

We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding.

Benchmarking Contrastive Learning +1

Paper
Add Code

PolyVoice: Language Models for Speech to Speech Translation

no code implementations • 5 Jun 2023 • Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang

For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.

Language Modelling Speech Synthesis +2

Paper
Add Code

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

no code implementations • 28 Aug 2023 • Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

We tested InstructME in instrument-editing, remixing, and multi-round editing.

Paper
Add Code

Teaching Text-to-Image Models to Communicate in Dialog

no code implementations • 27 Sep 2023 • Xiaowen Sun, Jiazhan Feng, Yuxuan Wang, Yuxuan Lai, Xingyu Shen, Dongyan Zhao

In this paper, we focus on the innovative dialog-to-image generation task, where the model synthesizes a high-resolution image aligned with the given dialog context as a response.

Sentence Text-to-Image Generation

Paper
Add Code

LLaMA Rider: Spurring Large Language Models to Explore the Open World

no code implementations • 13 Oct 2023 • Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu

Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.

Decision Making

Paper
Add Code

Audio Prompt Tuning for Universal Sound Separation

1 code implementation • 30 Nov 2023 • Yuzhuo Liu, Xubo Liu, Yan Zhao, Yuanyuan Wang, Rui Xia, Pingchuan Tain, Yuxuan Wang

Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen.

Paper
Code

A-SDM: Accelerating Stable Diffusion through Redundancy Removal and Performance Optimization

no code implementations • 24 Dec 2023 • Jinchao Zhu, Yuxuan Wang, Xiaobing Tu, Siyuan Pan, Pengfei Wan, Gao Huang

The Stable Diffusion Model (SDM) is a popular and efficient text-to-image (t2i) generation and image-to-image (i2i) generation model.

Quantization

Paper
Add Code

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

no code implementations • 19 Jan 2024 • Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Zhuo Chen, Lei Xie, Yuping Wang, Yuxuan Wang

Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech.

Language Modelling Voice Conversion

Paper
Add Code

TimeSiam: A Pre-Training Framework for Siamese Time-Series Modeling

no code implementations • 4 Feb 2024 • Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Yunzhong Qiu, Li Zhang, Jianmin Wang, Mingsheng Long

To emphasize temporal correlation modeling, this paper proposes TimeSiam as a simple but effective self-supervised pre-training framework for Time series based on Siamese networks.

Contrastive Learning Data Augmentation +1

Paper
Add Code

TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables

no code implementations • 29 Feb 2024 • Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong liu, Yunzhong Qiu, Haoran Zhang, Jianmin Wang, Mingsheng Long

Experimentally, TimeXer significantly improves time series forecasting with exogenous variables and achieves consistent state-of-the-art performance in twelve real-world forecasting benchmarks.

Time Series Time Series Forecasting

Paper
Add Code

View-Consistent 3D Editing with Gaussian Splatting

no code implementations • 18 Mar 2024 • Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang

The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations.

Paper
Add Code

Improving Scene Graph Generation with Relation Words' Debiasing in Vision-Language Models

no code implementations • 24 Mar 2024 • Yuxuan Wang, Xiaoyuan Liu

After that, we ensemble VLMs with SGG models to enhance representation.

Graph Generation Relation +1

Paper
Add Code

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

no code implementations • 10 Apr 2024 • Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre.

Attribute

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.