Talking-Heads Attention Explained

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Talking-Heads Attention** is a variation on [multi-head attention](https://paperswithcode.com/method/multi-head-attention) which includes linear projections across the attention-heads dimension, immediately before and after the [softmax](https://paperswithcode.com/method/softmax) operation. In [multi-head attention](https://paperswithcode.com/method/multi-head-attention), the different attention heads perform separate computations, which are then summed at the end. Talking-Heads Attention breaks that separation. Two additional learned linear projections are inserted, $P\_{l}$ and $P\_{w}$, which transform the attention-logits and the attention weights respectively, moving information across attention heads. Instead of one "heads" dimension $h$ across the whole computation, we now have three separate heads dimensions: $h\_{k}$, $h$, and $h\_{v}$, which can optionally differ in size (number of "heads"). $h\_{k}$ refers to the number of attention heads for the keys and the queries. $h$ refers to the number of attention heads for the logits and the weights, and $h\_{v}$ refers to the number of attention heads for the values.

Code Snippet URL (optional):

Image

Currently: methods/Screen_Shot_2021-09-17_at_9.51.43_AM.png Clear
Change:

Attached collections:

ATTENTION MODULES

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Language Modelling	1	50.00%
Question Answering	1	50.00%

Talking-Heads Attention

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove