Normalization

# ReZero

Introduced by Bachlechner et al. in ReZero is All You Need: Fast Convergence at Large Depth

ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal $x$ and one trainable parameter $\alpha$ that modulates the non-trivial transformation of a layer $F(\mathbf{x})$:

$$\mathbf{x}_{i+1}=\mathbf{x}_{i}+\alpha_{i} F\left(\mathbf{x}_{i}\right)$$

where $\alpha=0$ at the beginning of training. Initially the gradients for all parameters defining $F$ vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.

#### Papers

Paper Code Results Date Stars