Autoregressive Transformers

Adaptive Span Transformer

Introduced by Sukhbaatar et al. in Adaptive Attention Span in Transformers

The Adaptive Attention Span Transformer is a Transformer that utilises an improvement to the self-attention layer called adaptive masking that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.

Their proposals are based on the observation that, with the dense attention of a traditional Transformer, each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).

Source: Adaptive Attention Span in Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 2 50.00%
3D Part Segmentation 1 25.00%
Machine Translation 1 25.00%

Categories