A Highway Layer contains an information highway to other layers that helps with information flow. It is characterised by the use of a gating unit to help this information flow.
A plain feedforward neural network typically consists of $L$ layers where the $l$th layer ($l \in ${$1, 2, \dots, L$}) applies a nonlinear transform $H$ (parameterized by $\mathbf{W_{H,l}}$) on its input $\mathbf{x_{l}}$ to produce its output $\mathbf{y_{l}}$. Thus, $\mathbf{x_{1}}$ is the input to the network and $\mathbf{y_{L}}$ is the network’s output. Omitting the layer index and biases for clarity,
$$ \mathbf{y} = H\left(\mathbf{x},\mathbf{W_{H}}\right) $$
$H$ is usually an affine transform followed by a non-linear activation function, but in general it may take other forms.
For a highway network, we additionally define two nonlinear transforms $T\left(\mathbf{x},\mathbf{W_{T}}\right)$ and $C\left(\mathbf{x},\mathbf{W_{C}}\right)$ such that:
$$ \mathbf{y} = H\left(\mathbf{x},\mathbf{W_{H}}\right)·T\left(\mathbf{x},\mathbf{W_{T}}\right) + \mathbf{x}·C\left(\mathbf{x},\mathbf{W_{C}}\right)$$
We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by transforming the input and carrying it, respectively. In the original paper, the authors set $C = 1 − T$, giving:
$$ \mathbf{y} = H\left(\mathbf{x},\mathbf{W_{H}}\right)·T\left(\mathbf{x},\mathbf{W_{T}}\right) + \mathbf{x}·\left(1-T\left(\mathbf{x},\mathbf{W_{T}}\right)\right)$$
The authors set:
$$ T\left(x\right) = \sigma\left(\mathbf{W_{T}}^{T}\mathbf{x} + \mathbf{b_{T}}\right) $$
Image: Sik-Ho Tsang
Source: Highway NetworksPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Synthesis | 44 | 23.91% |
Text-To-Speech Synthesis | 15 | 8.15% |
Decoder | 12 | 6.52% |
Speech Recognition | 10 | 5.43% |
Language Modelling | 8 | 4.35% |
Sentence | 6 | 3.26% |
Voice Cloning | 5 | 2.72% |
Voice Conversion | 4 | 2.17% |
Translation | 3 | 1.63% |