Adaptive Span Transformer

Introduced by Sukhbaatar et al. in Adaptive Attention Span in Transformers

The Adaptive Attention Span Transformer is a Transformer that utilises an improvement to the self-attention layer called adaptive masking that allows the model to choose its own context size. This results in a network where each attention layer gathers information on their own context. This allows for scaling to input sequences of more than 8k tokens.

Their proposals are based on the observation that, with the dense attention of a traditional Transformer, each attention head shares the same attention span $S$ (attending over the full context). But many attention heads can specialize to more local context (others look at the longer sequence). This motivates the need for a variant of self-attention that allows the model to choose its own context size (adaptive masking - see components).

Source: Adaptive Attention Span in Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	2	40.00%
3D Part Segmentation	1	20.00%
Machine Translation	1	20.00%
Translation	1	20.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adaptive Masking	Attention Mechanisms
Attention Dropout	Regularization
Dense Connections	Feedforward Networks
Dropout	Regularization
Embedding Dropout	Regularization
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
ReLU	Activation Functions
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions

Categories

Add Remove

Transformers

Autoregressive Transformers