Search Results for author: Zejun Ma

Found 63 papers, 18 papers with code

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

1 code implementation • 16 Jan 2024 • Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video.

3D Reconstruction Super-Resolution +1

571

Paper
Code

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

no code implementations • 15 Nov 2023 • Jin Qiu, Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma

Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

SALMONN: Towards Generic Hearing Abilities for Large Language Models

1 code implementation • 20 Oct 2023 • Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music.

Audio captioning Automatic Speech Recognition +10

786

Paper
Code

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

2 code implementations • 9 Oct 2023 • Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs.

Question Answering Video Question Answering

Paper
Code

Connecting Speech Encoder and Large Language Model for ASR

no code implementations • 25 Sep 2023 • Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

no code implementations • 14 Jul 2023 • Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage.

In-Context Learning Language Modelling +3

Paper
Add Code

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

no code implementations • 27 Jun 2023 • Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi Ren, Xiang Yin, Zejun Ma

Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language.

Disentanglement Style Generalization

Paper
Add Code

Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects

1 code implementation • 14 Jun 2023 • Xinghua Qu, Hongyang Liu, Zhu Sun, Xiang Yin, Yew Soon Ong, Lu Lu, Zejun Ma

Conversational recommender systems (CRSs) have become crucial emerging research topics in the field of RSs, thanks to their natural advantages of explicitly acquiring user preferences via interactive conversations and revealing the reasons behind recommendations.

Recommendation Systems

Paper
Code

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

no code implementations • 9 Jun 2023 • Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, Zejun Ma

In this paper, we improve the frame-level classifier for word timings in E2E system by introducing label priors in connectionist temporal classification (CTC) loss, which is adopted from prior works, and combining low-level Mel-scale filter banks with high-level ASR encoder output as input feature.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer

no code implementations • 7 Jun 2023 • Lu Huang, Boyu Li, Jun Zhang, Lu Lu, Zejun Ma

Domain adaptation using text-only corpus is challenging in end-to-end(E2E) speech recognition.

Domain Adaptation Language Modelling +2

Paper
Add Code

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

no code implementations • 6 Jun 2023 • Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

We are interested in a novel task, namely low-resource text-to-talking avatar.

Neural Rendering Video Generation +1

Paper
Add Code

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

no code implementations • 6 Jun 2023 • Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies.

Attribute Inductive Bias +3

Paper
Add Code

PolyVoice: Language Models for Speech to Speech Translation

no code implementations • 5 Jun 2023 • Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang

For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.

Language Modelling Speech Synthesis +2

Paper
Add Code

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

no code implementations • 29 May 2023 • Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao

Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data.

Ranked #7 on Audio Generation on AudioCaps

Audio Generation Denoising +2

Paper
Add Code

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

no code implementations • 27 May 2023 • Linhao Dong, Zhecheng An, Peihao Wu, Jun Zhang, Lu Lu, Zejun Ma

We also observe the cross-modal representation extracted by CIF-PT obtains better performance than other neural interfaces for the tasks of SLU, including the dominant speech representation learned from self-supervised pre-training.

intent-classification Intent Classification +5

Paper
Add Code

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring

no code implementations • 19 May 2023 • Kaiqi Fu, Shaojun Gao, Shuju Shi, Xiaohai Tian, Wei Li, Zejun Ma

Specifically, we first pre-train the model using a reconstruction loss function, by masking phones and their durations jointly on a large amount of unlabeled speech and text prompts.

Self-Supervised Learning

Paper
Add Code

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

no code implementations • 1 May 2023 • Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao

Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.

motion prediction Talking Face Generation

Paper
Add Code

Enhancing Large Language Model with Self-Controlled Memory Framework

1 code implementation • 26 Apr 2023 • Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, Zhoujun Li

Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information.

Book summarization Document Summarization +5

Paper
Code

ByteCover3: Accurate Cover Song Identification on Short Queries

no code implementations • 21 Mar 2023 • Xingjian Du, Zijie Wang, Xia Liang, Huidong Liang, Bilei Zhu, Zejun Ma

Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI.

Cover song identification Retrieval

Paper
Add Code

LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion

no code implementations • 2 Mar 2023 • Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu, Xiang Yin, Zejun Ma

As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.

speech-recognition Speech Recognition

Paper
Add Code

Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

no code implementations • 21 Feb 2023 • Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li, Zejun Ma, Tan Lee

Recent studies on pronunciation scoring have explored the effect of introducing phone embeddings as reference pronunciation, but mostly in an implicit manner, i. e., addition or concatenation of reference phone embedding and actual pronunciation of the target phone as the phone-level pronunciation quality representation.

Paper
Add Code

An ASR-free Fluency Scoring Approach with Self-Supervised Learning

no code implementations • 20 Feb 2023 • Wei Liu, Kaiqi Fu, Xiaohai Tian, Shuju Shi, Wei Li, Zejun Ma, Tan Lee

A typical fluency scoring system generally relies on an automatic speech recognition (ASR) system to obtain time stamps in input speech for either the subsequent calculation of fluency-related features or directly modeling speech fluency with an end-to-end approach.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Virtual Try-On with Pose-Garment Keypoints Guided Inpainting

1 code implementation • ICCV 2023 • Zhi Li, Pengfei Wei, Xiang Yin, Zejun Ma, Alex C. Kot

In our method, human pose and garment keypoints are extracted from source images and constructed as graphs to predict the garment keypoints at the target pose.

Virtual Try-on

Paper
Code

Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features

no code implementations • 12 Dec 2022 • Junhui Zhang, Junjie Pan, Xiang Yin, Zejun Ma

Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation.

Speech-to-Speech Translation Translation

Paper
Add Code

BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance

1 code implementation • 13 Nov 2022 • Haotong Qin, Xudong Ma, Yifu Ding, Xiaoyang Li, Yang Zhang, Zejun Ma, Jiakai Wang, Jie Luo, Xianglong Liu

We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25. 1x speedup and 20. 2x storage-saving on edge hardware.

Binarization Keyword Spotting

Paper
Code

Graph Contrastive Learning with Implicit Augmentations

1 code implementation • 7 Nov 2022 • Huidong Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Ke Chen, Junbin Gao

Existing graph contrastive learning methods rely on augmentation techniques based on random perturbations (e. g., randomly adding or dropping edges and nodes).

Contrastive Learning Graph Classification +1

Paper
Code

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

no code implementations • 2 Nov 2022 • Rao Ma, Xiaobo Wu, Jin Qiu, Yanan Qin, HaiHua Xu, Peihao Wu, Zejun Ma

The proposed method can achieve significantly better performance on the target test sets while it gets minimal performance degradation on the general test set, compared with both shallow and ILME-based LM fusion methods.

Domain Adaptation Language Modelling

Paper
Add Code

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

no code implementations • 28 Oct 2022 • Yist Y. Lin, Tao Han, HaiHua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched.

Action Detection Activity Detection +4

Paper
Add Code

Importance Prioritized Policy Distillation

1 code implementation • KDD 2022 • Xinghua Qu, Yew-Soon Ong, Abhishek Gupta, Pengfei Wei, Zhu Sun, Zejun Ma

Given such an issue, we denote the \emph{frame importance} as its contribution to the expected reward on a particular frame, and hypothesize that adapting such frame importance could benefit the performance of the distilled student policy.

Atari Games Decision Making +1

Paper
Code

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

no code implementations • 10 Jun 2022 • Junhui Zhang, Wudi Bao, Junjie Pan, Xiang Yin, Zejun Ma

In this paper, we propose a novel Chinese dialect TTS frontend with a translation module, which converts Mandarin text into dialectic expressions to improve the intelligibility and naturalness of synthesized speech.

Machine Translation Translation

Paper
Add Code

Improving Contextual Representation with Gloss Regularized Pre-training

no code implementations • Findings (NAACL) 2022 • Yu Lin, Zhecheng An, Peihao Wu, Zejun Ma

To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity.

Paper
Add Code

BYTECOVER2: TOWARDS DIMENSIONALITY REDUCTION OF LATENT EMBEDDING FOR EFFICIENT COVER SONG IDENTIFICATION

no code implementations • ICASSP 2022 • Xingjian Du, Ke Chen, Zijie Wang, Bilei Zhu, Zejun Ma

Convolutional neural network (CNN)-based methods have dominated the recent research of cover song identification (CSI).

Ranked #1 on Cover song identification on SHS100K-TEST

Cover song identification Dimensionality Reduction

Paper
Add Code

Language Adaptive Cross-lingual Speech Representation Learning with Sparse Sharing Sub-networks

no code implementations • 9 Mar 2022 • Yizhou Lu, Mingkun Huang, Xinghua Qu, Pengfei Wei, Zejun Ma

It makes room for language specific modeling by pruning out unimportant parameters for each language, without requiring any manually designed language specific component.

Representation Learning speech-recognition +1

Paper
Add Code

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

no code implementations • 1 Mar 2022 • Kaiqi Fu, Shaojun Gao, Kai Wang, Wei Li, Xiaohai Tian, Zejun Ma

Moreover, we utilize multi-source information (e. g., MFCC and deep features) to further improve the scoring system performance.

Data Augmentation Word-level pronunciation scoring

Paper
Add Code

S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification

1 code implementation • 21 Feb 2022 • Hang Zhao, Chen Zhang, Belei Zhu, Zejun Ma, Kejun Zhang

To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification.

Classification Data Augmentation +5

Paper
Code

BiFSMN: Binary Neural Network for Keyword Spotting

1 code implementation • 14 Feb 2022 • Haotong Qin, Xudong Ma, Yifu Ding, Xiaoyang Li, Yang Zhang, Yao Tian, Zejun Ma, Jie Luo, Xianglong Liu

Then, to allow the instant and adaptive accuracy-efficiency trade-offs at runtime, we also propose a Thinnable Binarization Architecture to further liberate the acceleration potential of the binarized network from the topology perspective.

Binarization Keyword Spotting

Paper
Code

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

no code implementations • 9 Feb 2022 • Chen Shen, Yi Liu, Wenzhi Fan, Bin Wang, Shixue Wen, Yao Tian, Jun Zhang, Jingsheng Yang, Zejun Ma

For Track 1, we propose several approaches to empower the clustering-based speaker diarization system to handle overlapped speech.

Data Augmentation Language Modelling +4

Paper
Add Code

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

1 code implementation • 2 Feb 2022 • Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time.

Ranked #4 on Sound Event Detection on DESED

Audio Classification Event Detection +3

297

Paper
Code

Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection

1 code implementation • 30 Jan 2022 • Minglun Han, Linhao Dong, Zhenlin Liang, Meng Cai, Shiyu Zhou, Zejun Ma, Bo Xu

Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge.

speech-recognition Speech Recognition

Paper
Code

Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR

no code implementations • 26 Jan 2022 • Yufei Liu, Rao Ma, HaiHua Xu, Yi He, Zejun Ma, Weibin Zhang

In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework.

Language Modelling Speech Recognition

Paper
Add Code

Towards Realistic Visual Dubbing with Heterogeneous Sources

no code implementations • 17 Jan 2022 • Tianyi Xie, Liucheng Liao, Cheng Bi, Benlai Tang, Xiang Yin, Jianfei Yang, Mingjie Wang, Jiali Yao, Yang Zhang, Zejun Ma

The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video.

Disentanglement Talking Head Generation

Paper
Add Code

Zero-shot Audio Source Separation through Query-based Learningfrom Weakly-labeled Data

no code implementations • AAAI 2021 • Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.

Audio Source Separation Event Detection +2

Paper
Add Code

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

1 code implementation • 15 Dec 2021 • Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.

Ranked #1 on Audio Source Separation on AudioSet

Audio Source Separation Audio Tagging +3

163

Paper
Code

Towards Using Clothes Style Transfer for Scenario-aware Person Video Generation

1 code implementation • 14 Oct 2021 • Jingning Xu, Benlai Tang, Mingjie Wang, Siyuan Bian, Wenyi Guo, Xiang Yin, Zejun Ma

To tackle this problem, most recent AdaIN-based architectures are proposed to extract clothes and scenario features for generation.

Style Transfer Video Generation

Paper
Code

Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

no code implementations • 10 Oct 2021 • Chao Wang, Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Yibiao Yu, Zejun Ma

Experiments show that, compared with the baseline models, our proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

1 code implementation • 8 Oct 2021 • Pengfei Wu, Junjie Pan, Chenchang Xu, Junhui Zhang, Lin Wu, Xiang Yin, Zejun Ma

In expressive speech synthesis, there are high requirements for emotion interpretation.

Expressive Speech Synthesis

170

Paper
Code

Improving Pseudo-label Training For End-to-end Speech Recognition Using Gradient Mask

no code implementations • 8 Oct 2021 • Shaoshi Ling, Chen Shen, Meng Cai, Zejun Ma

In the recent trend of semi-supervised speech recognition, both self-supervised representation learning and pseudo-labeling have shown promising results.

Pseudo Label Representation Learning +2

Paper
Add Code

Synthesising Audio Adversarial Examples for Automatic Speech Recognition

no code implementations • 29 Sep 2021 • Xinghua Qu, Pengfei Wei, Mingyong Gao, Zhu Sun, Yew-Soon Ong, Zejun Ma

Adversarial examples in automatic speech recognition (ASR) are naturally sounded by humans yet capable of fooling well trained ASR models to transcribe incorrectly.

Audio Synthesis Automatic Speech Recognition +2

Paper
Add Code

HMM-Free Encoder Pre-Training for Streaming RNN Transducer

no code implementations • 2 Apr 2021 • Lu Huang, Jingyu Sun, Yufeng Tang, JunFeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma

This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model.

Speech Recognition

Paper
Add Code

An Improved Transfer Model: Randomized Transferable Machine

no code implementations • 27 Nov 2020 • Pengfei Wei, Xinghua Qu, Yew Soon Ong, Zejun Ma

Existing studies usually assume that the learned new feature representation is \emph{domain-invariant}, and thus train a transfer model $\mathcal{M}$ on the source domain.

Transfer Learning

Paper
Add Code

Dynamic latency speech recognition with asynchronous revision

no code implementations • 3 Nov 2020 • Mingkun Huang, Meng Cai, Jun Zhang, Yang Zhang, Yongbin You, Yi He, Zejun Ma

In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models.

speech-recognition Speech Recognition

Paper
Add Code

Improving RNN transducer with normalized jointer network

no code implementations • 3 Nov 2020 • Mingkun Huang, Jun Zhang, Meng Cai, Yang Zhang, Jiali Yao, Yongbin You, Yi He, Zejun Ma

In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

PPG-based singing voice conversion with adversarial representation learning

no code implementations • 28 Oct 2020 • Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Ling Xu, Chen Shen, Zejun Ma

Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody.

Representation Learning Voice Conversion +1

Paper
Add Code

ByteCover: Cover Song Identification via Multi-Loss Training

1 code implementation • 27 Oct 2020 • Xingjian Du, Zhesong Yu, Bilei Zhu, Xiaoou Chen, Zejun Ma

We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI).

Ranked #2 on Cover song identification on Da-TACOS

Cover song identification

Paper
Code

Rule-embedded network for audio-visual voice activity detection in live musical video streams

1 code implementation • 27 Oct 2020 • Yuanbo Hou, Yi Deng, Bilei Zhu, Zejun Ma, Dick Botteldooren

Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing.

Sound Multimedia Audio and Speech Processing

Paper
Code

Contrastive Unsupervised Learning for Audio Fingerprinting

no code implementations • 26 Oct 2020 • Zhesong Yu, Xingjian Du, Bilei Zhu, Zejun Ma

The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet.

Contrastive Learning

Paper
Add Code

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

no code implementations • 19 May 2020 • Wenjie Li, Benlai Tang, Xiang Yin, Yushi Zhao, Wei Li, Kang Wang, Hao Huang, Yuxuan Wang, Zejun Ma

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.

Paper
Add Code

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

no code implementations • 23 Apr 2020 • Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders.

Singing Voice Synthesis

Paper
Add Code

A hybrid text normalization system using multi-head self-attention for mandarin

no code implementations • 11 Nov 2019 • Junhui Zhang, Junjie Pan, Xiang Yin, Chen Li, Shichao Liu, Yang Zhang, Yuxuan Wang, Zejun Ma

In this paper, we propose a hybrid text normalization system using multi-head self-attention.

Sentence

Paper
Add Code

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

no code implementations • 11 Nov 2019 • Junjie Pan, Xiang Yin, Zhiling Zhang, Shichao Liu, Yang Zhang, Zejun Ma, Yuxuan Wang

In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech.

Polyphone disambiguation Speech Synthesis +1

Paper
Add Code

Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model

no code implementations • 17 May 2017 • Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei

The system which combined frame retaining with frame stacking could reduces the time consumption of both training and decoding.

General Classification

Paper
Add Code

Deep LSTM for Large Vocabulary Continuous Speech Recognition

no code implementations • 21 Mar 2017 • Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei, Peihao Wu, Wenchang Situ, Shuai Li, Yang Zhang

It is a competitive framework that LSTM models of more than 7 layers are successfully trained on Shenma voice search data in Mandarin and they outperform the deep LSTM models trained by conventional approach.

speech-recognition Speech Recognition +1

Paper
Add Code

Exponential Moving Average Model in Parallel Speech Recognition Training

no code implementations • 3 Mar 2017 • Xu Tian, Jun Zhang, Zejun Ma, Yi He, Juan Wei

As training data rapid growth, large-scale parallel training with multi-GPUs cluster is widely applied in the neural network model learning currently. We present a new approach that applies exponential moving average method in large-scale parallel training of neural network model.

speech-recognition Speech Recognition

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.