Attention Modules

Talking-Heads Attention

Introduced by Shazeer et al. in Talking-Heads Attention

Talking-Heads Attention is a variation on multi-head attention which includes linear projections across the attention-heads dimension, immediately before and after the softmax operation. In multi-head attention, the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, $P_{l}$ and $P_{w}$, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one "heads" dimension $h$ across the whole computation, we now have three separate heads dimensions: $h_{k}$, $h$, and $h_{v}$, which can optionally differ in size (number of "heads"). $h_{k}$ refers to the number of attention heads for the keys and the queries. $h$ refers to the number of attention heads for the logits and the weights, and $h_{v}$ refers to the number of attention heads for the values.

Source: Talking-Heads Attention

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 1 50.00%
Question Answering 1 50.00%

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories