HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

23 Oct 2022  ·  Chunhui Wang, Chang Zeng, Jun Chen, Xing He ·

Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform. Additionally, an auxiliary spectrogram-phase loss is utilized to approximate the real distribution further. The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods