CBHG is a building block used in the Tacotron text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit (BiGRU).
The module is used to extract representations from sequences. The input sequence is first convolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C_{k}$ filters of width $k$ (i.e. $k = 1, 2, \dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The convolution outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. Batch normalization is used for all convolutional layers. The convolution outputs are fed into a multi-layer highway network to extract high-level features. Finally, a bidirectional GRU RNN is stacked on top to extract sequential features from both forward and backward context.
Source: Tacotron: Towards End-to-End Speech SynthesisPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Synthesis | 43 | 38.39% |
Text-To-Speech Synthesis | 15 | 13.39% |
Decoder | 10 | 8.93% |
Sentence | 6 | 5.36% |
Voice Cloning | 5 | 4.46% |
Voice Conversion | 4 | 3.57% |
Speech Recognition | 4 | 3.57% |
Expressive Speech Synthesis | 3 | 2.68% |
Self-Supervised Learning | 2 | 1.79% |