Griffin-Lim Algorithm

The Griffin-Lim Algorithm (GLA) is a phase reconstruction method based on the redundancy of the short-time Fourier transform. It promotes the consistency of a spectrogram by iterating two projections, where a spectrogram is said to be consistent when its inter-bin dependency owing to the redundancy of STFT is retained. GLA is based only on the consistency and does not take any prior knowledge about the target signal into account.

This algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given amplitude $\mathbf{A}$, by the following alternative projection procedure:

$$ \mathbf{X}^{[m+1]} = P_{\mathcal{C}}\left(P_{\mathcal{A}}\left(\mathbf{X}^{[m]}\right)\right) $$

where $\mathbf{X}$ is a complex-valued spectrogram updated through the iteration, $P_{\mathcal{S}}$ is the metric projection onto a set $\mathcal{S}$, and $m$ is the iteration index. Here, $\mathcal{C}$ is the set of consistent spectrograms, and $\mathcal{A}$ is the set of spectrograms whose amplitude is the same as the given one. The metric projections onto these sets $\mathcal{C}$ and $\mathcal{A}$ are given by:

$$ P_{\mathcal{C}}(\mathbf{X}) = \mathcal{GG}^{†}\mathbf{X} $$ $$ P_{\mathcal{A}}(\mathbf{X}) = \mathbf{A} \odot \mathbf{X} \oslash |\mathbf{X}| $$

where $\mathcal{G}$ represents STFT, $\mathcal{G}^{†}$ is the pseudo inverse of STFT (iSTFT), $\odot$ and $\oslash$ are element-wise multiplication and division, respectively, and division by zero is replaced by zero. GLA is obtained as an algorithm for the following optimization problem:

$$ \min_{\mathbf{X}} || \mathbf{X} - P_{\mathcal{C}}\left(\mathbf{X}\right) ||^{2}_{\text{Fro}} \text{ s.t. } \mathbf{X} \in \mathcal{A} $$

where $ || · ||_{\text{Fro}}$ is the Frobenius norm. This equation minimizes the energy of the inconsistent components under the constraint on amplitude which must be equal to the given one. Although GLA has been widely utilized because of its simplicity, GLA often involves many iterations until it converges to a certain spectrogram and results in low reconstruction quality. This is because the cost function only requires the consistency, and the characteristics of the target signal are not taken into account.

Latest Papers

PAPER DATE
Learning Speaker Embedding from Text-to-Speech
Jaejin ChoPiotr ZelaskoJesus VillalbaShinji WatanabeNajim Dehak
2020-10-21
Grapheme or phoneme? An Analysis of Tacotron's Embedded Representations
Antoine PerquinErica CooperJunichi Yamagishi
2020-10-21
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Jonathan ShenYe JiaMike ChrzanowskiYu ZhangIsaac EliasHeiga ZenYonghui Wu
2020-10-08
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search
| Jaehyeon KimSungwon KimJungil KongSungroh Yoon
2020-10-01
Controllable neural text-to-speech synthesis using intuitive prosodic features
Tuomo RaitioRamya RasipuramDan Castellani
2020-09-14
Corrective feedback, emphatic speech synthesis, visual-speech exaggeration, pronunciation learning
Yaohua BuWeijun LiTianyi MaShengqi ChenJia JiaKun LiXiaobo Lu
2020-09-12
Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
| Dipjyoti PaulMuhammed PV ShifasYannis PantazisYannis Stylianou
2020-08-13
Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Rui LiuBerrak SismanFeilong BaoGuanglai GaoHaizhou Li
2020-08-11
SpeedySpeech: Efficient Neural Speech Synthesis
| Jan VainerOndřej Dušek
2020-08-09
One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
| Tomáš NekvindaOndřej Dušek
2020-08-03
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Yusuke YasudaXin WangJunichi Yamagishi
2020-05-20
Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency
Mohammad Asif KhanFabien CardinauxStefan UhlichMarc FerrasAsja Fischer
2020-05-15
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Aleksandr LaptevRoman KorostikAleksey SvischevAndrei AndrusenkoIvan MedennikovSergey Rybin
2020-05-14
End-To-End Speech Synthesis Applied to Brazilian Portuguese
| Edresson CasanovaArnaldo Candido JuniorChristopher ShulbyFrederico Santos de OliveiraJoão Paulo TeixeiraMoacir Antonelli PontiSandra Maria Aluisio
2020-05-11
WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss
Rui LiuBerrak SismanFeilong BaoGuanglai GaoHaizhou Li
2020-02-02
High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram
Leyuan ShengDong-Yan HuangEvgeniy N. Pavlovskiy
2019-12-03
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis
Junjie PanXiang YinZhiling ZhangShichao LiuYang ZhangZejun MaYuxuan Wang
2019-11-11
Speech Recognition with Augmented Synthesized Speech
Andrew RosenbergYu ZhangBhuvana RamabhadranYe JiaPedro MorenoYonghui WuZelin Wu
2019-09-25
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
| Yu ZhangRon J. WeissHeiga ZenYonghui WuZhifeng ChenRJ Skerry-RyanYe JiaAndrew RosenbergBhuvana Ramabhadran
2019-07-09
A New GAN-based End-to-End TTS Training Algorithm
Haohan GuoFrank K. SoongLei HeLei Xie
2019-04-09
Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data
Roee Levy LeshemRaja Giryes
2019-04-06
Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis
Yanyao BianChangbin ChenYongguo KangZhenglin Pan
2019-04-04
Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
Mingyang ZhangXin WangFuming FangHaizhou LiJunichi Yamagishi
2019-03-29
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language
| Yusuke YasudaXin WangShinji TakakiJunichi Yamagishi
2018-10-29
Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis
Yu-An ChungYuxuan WangWei-Ning HsuYu ZhangRJ Skerry-Ryan
2018-08-30
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis
Daisy StantonYuxuan WangRJ Skerry-Ryan
2018-08-04
Voice Imitating Text-to-Speech Neural Networks
Younggun LeeTaesu KimSoo-Young Lee
2018-06-04
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
| RJ Skerry-RyanEric BattenbergYing XiaoYuxuan WangDaisy StantonJoel ShorRon J. WeissRob ClarkRif A. Saurous
2018-03-24
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
| Yuxuan WangDaisy StantonYu ZhangRJ Skerry-RyanEric BattenbergJoel ShorYing XiaoFei RenYe JiaRif A. Saurous
2018-03-23
Adversarial Audio Synthesis
| Chris DonahueJulian McAuleyMiller Puckette
2018-02-12
Emotional End-to-End Neural Speech Synthesizer
| Younggun LeeAzam RabieeSoo-Young Lee
2017-11-15
Uncovering Latent Style Factors for Expressive Speech Synthesis
Yuxuan WangRJ Skerry-RyanYing XiaoDaisy StantonJoel ShorEric BattenbergRob ClarkRif A. Saurous
2017-11-01
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
| Wei PingKainan PengAndrew GibianskySercan O. ArikAjay KannanSharan NarangJonathan RaimanJohn Miller
2017-10-20
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Sercan ArikGregory DiamosAndrew GibianskyJohn MillerKainan PengWei PingJonathan RaimanYanqi Zhou
2017-05-24
Tacotron: Towards End-to-End Speech Synthesis
| Yuxuan WangRJ Skerry-RyanDaisy StantonYonghui WuRon J. WeissNavdeep JaitlyZongheng YangYing XiaoZhifeng ChenSamy BengioQuoc LeYannis AgiomyrgiannakisRob ClarkRif A. Saurous
2017-03-29

Components

COMPONENT TYPE
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories