HiFi-GAN is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.
The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every transposed convolution is followed by a multi-receptive field fusion (MRF) module.
For the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in MelGAN is used, which consecutively evaluates audio samples at different levels.
Source: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech SynthesisPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Synthesis | 16 | 16.49% |
Text to Speech | 13 | 13.40% |
Text-To-Speech Synthesis | 7 | 7.22% |
Voice Conversion | 6 | 6.19% |
Audio Generation | 3 | 3.09% |
Decoder | 3 | 3.09% |
Self-Supervised Learning | 2 | 2.06% |
Automatic Speech Recognition (ASR) | 2 | 2.06% |
Speech Recognition | 2 | 2.06% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |