ReZero

Introduced by Bachlechner et al. in ReZero is All You Need: Fast Convergence at Large Depth

ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal $x$ and one trainable parameter $\alpha$ that modulates the non-trivial transformation of a layer $F(\mathbf{x})$:

$$ \mathbf{x}_{i+1}=\mathbf{x}_{i}+\alpha_{i} F\left(\mathbf{x}_{i}\right) $$

where $\alpha=0$ at the beginning of training. Initially the gradients for all parameters defining $F$ vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.

Source: ReZero is All You Need: Fast Convergence at Large Depth

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Time Series Analysis	1	25.00%
Time Series Forecasting	1	25.00%
Clustering	1	25.00%
Language Modelling	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Residual Connection	Skip Connections

Categories

Add Remove

Normalization