ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal $x$ and one trainable parameter $\alpha$ that modulates the non-trivial transformation of a layer $F(\mathbf{x})$:
$$ \mathbf{x}_{i+1}=\mathbf{x}_{i}+\alpha_{i} F\left(\mathbf{x}_{i}\right) $$
where $\alpha=0$ at the beginning of training. Initially the gradients for all parameters defining $F$ vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.
Source: ReZero is All You Need: Fast Convergence at Large DepthPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Time Series Analysis | 1 | 25.00% |
Time Series Forecasting | 1 | 25.00% |
Clustering | 1 | 25.00% |
Language Modelling | 1 | 25.00% |