LayerScale Explained | Papers With Code

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**LayerScale** is a method used for [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper high-capacity image transformers that benefit from depth.

Specifically, LayerScale is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:

$$
x\_{l}^{\prime} =x\_{l}+\operatorname{diag}\left(\lambda\_{l, 1}, \ldots, \lambda\_{l, d}\right) \times \operatorname{SA}\left(\eta\left(x\_{l}\right)\right) 
$$

$$
x\_{l+1} =x\_{l}^{\prime}+\operatorname{diag}\left(\lambda\_{l, 1}^{\prime}, \ldots, \lambda\_{l, d}^{\prime}\right) \times \operatorname{FFN}\left(\eta\left(x\_{l}^{\prime}\right)\right)
$$

where the parameters $\lambda\_{l, i}$ and $\lambda\_{l, i}^{\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\varepsilon:$ we set it to $\varepsilon=0.1$ until depth 18 , $\varepsilon=10^{-5}$ for depth 24 and $\varepsilon=10^{-6}$ for deeper networks.

This formula is akin to other [normalization](https://paperswithcode.com/methods/category/normalization) strategies [ActNorm](https://paperswithcode.com/method/activation-normalization) or [LayerNorm](https://paperswithcode.com/method/layer-normalization) but executed on output of the residual block. Yet LayerScale seeks a different effect: [ActNorm](https://paperswithcode.com/method/activation-normalization) is a data-dependent initialization that calibrates activations so that they have zero-mean and unit variance, like [BatchNorm](https://paperswithcode.com/method/batch-normalization). In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of [ReZero](https://paperswithcode.com/method/rezero), [SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup): to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in [ReZero](https://paperswithcode.com/method/rezero)/[SkipInit](https://paperswithcode.com/method/skipinit), [Fixup](https://paperswithcode.com/method/fixup-initialization) and [T-Fixup](https://paperswithcode.com/method/t-fixup).

Code Snippet URL (optional):

Image

Currently: methods/c38ef2bf-bbc6-4c5e-b3a2-b588252faace.png Clear
Change:

Attached collections:

REGULARIZATION

NORMALIZATION

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Image Classification	6	25.00%
Semantic Segmentation	3	12.50%
Fine-Grained Image Classification	2	8.33%
Classification	2	8.33%
Object Detection	2	8.33%
Self-Supervised Learning	1	4.17%
Domain Generalization	1	4.17%
Real-Time Object Detection	1	4.17%
Image Segmentation	1	4.17%

LayerScale

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove