GLU Variants Improve Transformer
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
PDF AbstractCode
Tasks
Results from the Paper
Submit
results from this paper
to get state-of-the-art GitHub badges and help the
community compare results to other papers.
Methods
Absolute Position Encodings •
Adam •
BPE •
Dense Connections •
Dropout •
GeGLU •
GELU •
GLU •
Label Smoothing •
Layer Normalization •
Linear Layer •
Multi-Head Attention •
Position-Wise Feed-Forward Layer •
ReGLU •
ReLU •
Residual Connection •
Scaled Dot-Product Attention •
Softmax •
SwiGLU •
Test •
Transformer