• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:
y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))
Whereas the parallel formulation can be written as:
y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))
The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.
Source: PaLM: Scaling Language Modeling with PathwaysPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Auto Debugging | 1 | 5.00% |
Code Generation | 1 | 5.00% |
Common Sense Reasoning | 1 | 5.00% |
Coreference Resolution | 1 | 5.00% |
Cross-Lingual Question Answering | 1 | 5.00% |
Few-Shot Learning | 1 | 5.00% |
Hindu Knowledge | 1 | 5.00% |
Known Unknowns | 1 | 5.00% |
Language Modelling | 1 | 5.00% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |