Attention Mechanisms
# Scaled Dot-Product Attention

Introduced by Vaswani et al. in Attention Is All You Need
#### Papers

#### Tasks

#### Usage Over Time

####
Categories

**Scaled dot-product attention** is an attention mechanism where the dot products are scaled down by $\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:

$$ {\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V $$

If we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \cdot k = \sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$. Since we would prefer these values to have variance $1$, we divide by $\sqrt{d_k}$.

Source: Attention Is All You NeedPaper | Code | Results | Date | Stars |
---|

Task | Papers | Share |
---|---|---|

RAG | 53 | 5.93% |

Retrieval | 51 | 5.70% |

Language Modelling | 45 | 5.03% |

Question Answering | 35 | 3.91% |

Semantic Segmentation | 21 | 2.35% |

Large Language Model | 20 | 2.24% |

Decoder | 19 | 2.13% |

Decision Making | 15 | 1.68% |

Image Classification | 14 | 1.57% |