T-Fixup is an initialization method for Transformers that aims to remove the need for layer normalization and warmup. The initialization procedure is as follows:

  • Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization $\mathcal{N}\left(0, d^{-\frac{1}{2}}\right)$ for input embeddings where $d$ is the embedding dimension.
  • Scale $\mathbf{v}_{d}$ and $\mathbf{w}_{d}$ matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings $\mathbf{x}$ and $\mathbf{y}$ in encoder and decoder by $(9 N)^{-\frac{1}{4}}$
  • Scale $\mathbf{v}_{e}$ and $\mathbf{w}_{e}$ matrices in each encoder attention block and weight matrices in each encoder MLP block by $0.67 N^{-\frac{1}{4}}$
Source: Improving Transformer Optimization Through Better Initialization


Paper Code Results Date Stars


Task Papers Share
Reading Comprehension 1 20.00%
Semantic Parsing 1 20.00%
Text-To-Sql 1 20.00%
Language Modelling 1 20.00%
Machine Translation 1 20.00%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign