NormFormer is a type of Pre-LN transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.

Source: NormFormer: Improved Transformer Pretraining with Extra Normalization


Paper Code Results Date Stars


Task Papers Share
Language Modelling 1 100.00%