NormFormer

Introduced by Shleifer et al. in NormFormer: Improved Transformer Pretraining with Extra Normalization

NormFormer is a type of Pre-LN transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.

Source: NormFormer: Improved Transformer Pretraining with Extra Normalization

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	1	100.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Layer Normalization	Normalization
Position-Wise Feed-Forward Layer	Feedforward Networks

Categories

Add Remove

Transformers