LayerScale is a method used for vision transformer architectures to help improve training dynamics. It adds a learnable diagonal matrix on output of each residual block, initialized close to (but not at) 0. Adding this simple layer after each residual block improves the training dynamic, allowing for the training of deeper highcapacity image transformers that benefit from depth.
Specifically, LayerScale is a perchannel multiplication of the vector produced by each residual block, as opposed to a single scalar, see Figure (d). The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block. In other words:
$$ x_{l}^{\prime} =x_{l}+\operatorname{diag}\left(\lambda_{l, 1}, \ldots, \lambda_{l, d}\right) \times \operatorname{SA}\left(\eta\left(x_{l}\right)\right) $$
$$ x_{l+1} =x_{l}^{\prime}+\operatorname{diag}\left(\lambda_{l, 1}^{\prime}, \ldots, \lambda_{l, d}^{\prime}\right) \times \operatorname{FFN}\left(\eta\left(x_{l}^{\prime}\right)\right) $$
where the parameters $\lambda_{l, i}$ and $\lambda_{l, i}^{\prime}$ are learnable weights. The diagonal values are all initialized to a fixed small value $\varepsilon:$ we set it to $\varepsilon=0.1$ until depth 18 , $\varepsilon=10^{5}$ for depth 24 and $\varepsilon=10^{6}$ for deeper networks.
This formula is akin to other normalization strategies ActNorm or LayerNorm but executed on output of the residual block. Yet LayerScale seeks a different effect: ActNorm is a datadependent initialization that calibrates activations so that they have zeromean and unit variance, like BatchNorm. In contrast, in LayerScale, we initialize the diagonal with small values so that the initial contribution of the residual branches to the function implemented by the transformer is small. In that respect the motivation is therefore closer to that of ReZero, SkipInit, Fixup and TFixup: to train closer to the identity function and let the network integrate the additional parameters progressively during the training. LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar as in ReZero/SkipInit, Fixup and TFixup.
Source: Going deeper with Image TransformersPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Image Classification  4  28.57% 
Object Detection  2  14.29% 
Semantic Segmentation  2  14.29% 
Domain Generalization  1  7.14% 
Speaker Verification  1  7.14% 
FineGrained Image Classification  1  7.14% 
General Classification  1  7.14% 
Machine Translation  1  7.14% 
SelfSupervised Image Classification  1  7.14% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 