Multiplicative Attention is an attention mechanism where the alignment score function is calculated as:
$$f_{att}\left(\textbf{h}_{i}, \textbf{s}_{j}\right) = \mathbf{h}_{i}^{T}\textbf{W}_{a}\mathbf{s}_{j}$$
Here $\mathbf{h}$ refers to the hidden states for the encoder/source, and $\mathbf{s}$ is the hidden states for the decoder/target. The function above is thus a type of alignment score function. We can use a matrix of alignment scores to show the correlation between source and target words, as the Figure to the right shows. Within a neural network, once we have the alignment scores, we calculate the final scores using a softmax function of these alignment scores (ensuring it sums to 1).
Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d_{h}$ of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale $f_{att}\left(\textbf{h}_{i}, \textbf{s}_{j}\right)$ by $1/\sqrt{d_{h}}$ as with scaled dot-product attention.
Source: Deep Learning for NLP Best Practices by Sebastian RuderPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Graph Attention | 1 | 9.09% |
Link Prediction | 1 | 9.09% |
Node Classification | 1 | 9.09% |
Named Entity Recognition (NER) | 1 | 9.09% |
NER | 1 | 9.09% |
Speech Enhancement | 1 | 9.09% |
Image-guided Story Ending Generation | 1 | 9.09% |
Machine Translation | 1 | 9.09% |
NMT | 1 | 9.09% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |