Talking-Heads Attention is a variation on multi-head attention which includes linear projections across the attention-heads dimension, immediately before and after the softmax operation. In multi-head attention, the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, $P_{l}$ and $P_{w}$, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one "heads" dimension $h$ across the whole computation, we now have three separate heads dimensions: $h_{k}$, $h$, and $h_{v}$, which can optionally differ in size (number of "heads"). $h_{k}$ refers to the number of attention heads for the keys and the queries. $h$ refers to the number of attention heads for the logits and the weights, and $h_{v}$ refers to the number of attention heads for the values.
Source: Talking-Heads AttentionPaper | Code | Results | Date | Stars |
---|
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |