Griffin-Lim Algorithm

The Griffin-Lim Algorithm (GLA) is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on the consistency and does not take any prior knowledge about the target signal into account.

This algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\mathbf{A}$, by the following alternative projection procedure:

$$ \mathbf{X}^{[m+1]} = P_{\mathcal{C}}\left(P_{\mathcal{A}}\left(\mathbf{X}^{[m]}\right)\right) $$

where $\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P_{\mathcal{S}}$ is the metric projection onto a set $\mathcal{S}$, and $m$ is the iteration index. Here, $\mathcal{C}$ is the set of consistent spectrograms, and $\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\mathcal{C}$ and $\mathcal{A}$ are given by:

$$ P_{\mathcal{C}}(\mathbf{X}) = \mathcal{GG}^{†}\mathbf{X} $$ $$ P_{\mathcal{A}}(\mathbf{X}) = \mathbf{A} \odot \mathbf{X} \oslash |\mathbf{X}| $$

where $\mathcal{G}$ represents STFT, $\mathcal{G}^{†}$ is the pseudo inverse of STFT (iSTFT), $\odot$ and $\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:

$$ \min_{\mathbf{X}} || \mathbf{X} - P_{\mathcal{C}}\left(\mathbf{X}\right) ||^{2}_{\text{Fro}} \text{ s.t. } \mathbf{X} \in \mathcal{A} $$

where $ || · ||_{\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.

Latest Papers

Learning Speaker Embedding from Text-to-Speech
Jaejin ChoPiotr ZelaskoJesus VillalbaShinji WatanabeNajim Dehak
Grapheme or phoneme? An Analysis of Tacotron's Embedded Representations
Antoine PerquinErica CooperJunichi Yamagishi
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Jonathan ShenYe JiaMike ChrzanowskiYu ZhangIsaac EliasHeiga ZenYonghui Wu
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
| Jaehyeon KimSungwon KimJungil KongSungroh Yoon
Controllable neural text-to-speech synthesis using intuitive prosodic features
Tuomo RaitioRamya RasipuramDan Castellani
Corrective feedback, emphatic speech synthesis, visual-speech exaggeration, pronunciation learning
Yaohua BuWeijun LiTianyi MaShengqi ChenJia JiaKun LiXiaobo Lu
Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
| Dipjyoti PaulMuhammed PV ShifasYannis PantazisYannis Stylianou
Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Rui LiuBerrak SismanFeilong BaoGuanglai GaoHaizhou Li
SpeedySpeech: Efficient Neural Speech Synthesis
| Jan VainerOndřej Dušek
One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
| Tomáš NekvindaOndřej Dušek
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Yusuke YasudaXin WangJunichi Yamagishi
Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency
Mohammad Asif KhanFabien CardinauxStefan UhlichMarc FerrasAsja Fischer
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Aleksandr LaptevRoman KorostikAleksey SvischevAndrei AndrusenkoIvan MedennikovSergey Rybin
End-To-End Speech Synthesis Applied to Brazilian Portuguese
| Edresson CasanovaArnaldo Candido JuniorChristopher ShulbyFrederico Santos de OliveiraJoão Paulo TeixeiraMoacir Antonelli PontiSandra Maria Aluisio
WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
Rui LiuBerrak SismanFeilong BaoGuanglai GaoHaizhou Li
High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram
Leyuan ShengDong-Yan HuangEvgeniy N. Pavlovskiy
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis
Junjie PanXiang YinZhiling ZhangShichao LiuYang ZhangZejun MaYuxuan Wang
Speech Recognition with Augmented Synthesized Speech
Andrew RosenbergYu ZhangBhuvana RamabhadranYe JiaPedro MorenoYonghui WuZelin Wu
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
| Yu ZhangRon J. WeissHeiga ZenYonghui WuZhifeng ChenRJ Skerry-RyanYe JiaAndrew RosenbergBhuvana Ramabhadran
A New GAN-based End-to-End TTS Training Algorithm
Haohan GuoFrank K. SoongLei HeLei Xie
Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data
Roee Levy LeshemRaja Giryes
Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis
Yanyao BianChangbin ChenYongguo KangZhenglin Pan
Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
Mingyang ZhangXin WangFuming FangHaizhou LiJunichi Yamagishi
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language
| Yusuke YasudaXin WangShinji TakakiJunichi Yamagishi
Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis
Yu-An ChungYuxuan WangWei-Ning HsuYu ZhangRJ Skerry-Ryan
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis
Daisy StantonYuxuan WangRJ Skerry-Ryan
Voice Imitating Text-to-Speech Neural Networks
Younggun LeeTaesu KimSoo-Young Lee
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
| RJ Skerry-RyanEric BattenbergYing XiaoYuxuan WangDaisy StantonJoel ShorRon J. WeissRob ClarkRif A. Saurous
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
| Yuxuan WangDaisy StantonYu ZhangRJ Skerry-RyanEric BattenbergJoel ShorYing XiaoFei RenYe JiaRif A. Saurous
Adversarial Audio Synthesis
| Chris DonahueJulian McAuleyMiller Puckette
Emotional End-to-End Neural Speech Synthesizer
| Younggun LeeAzam RabieeSoo-Young Lee
Uncovering Latent Style Factors for Expressive Speech Synthesis
Yuxuan WangRJ Skerry-RyanYing XiaoDaisy StantonJoel ShorEric BattenbergRob ClarkRif A. Saurous
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
| Wei PingKainan PengAndrew GibianskySercan O. ArikAjay KannanSharan NarangJonathan RaimanJohn Miller
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Sercan ArikGregory DiamosAndrew GibianskyJohn MillerKainan PengWei PingJonathan RaimanYanqi Zhou
Tacotron: Towards End-to-End Speech Synthesis
| Yuxuan WangRJ Skerry-RyanDaisy StantonYonghui WuRon J. WeissNavdeep JaitlyZongheng YangYing XiaoZhifeng ChenSamy BengioQuoc LeYannis AgiomyrgiannakisRob ClarkRif A. Saurous


🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign