Attention Patterns

Sliding Window Attention

Introduced by Beltagy et al. in Longformer: The Long-Document Transformer

Sliding Window Attention is an attention pattern for attention-based models. It was proposed as part of the Longformer architecture. It is motivated by the fact that non-sparse attention in the original Transformer formulation has a self-attention component with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. Given the importance of local context, the sliding window attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input.

More formally, in this attention pattern, given a fixed window size $w$, each token attends to $\frac{1}{2}w$ tokens on each side. The computation complexity of this pattern is $O\left(n×w\right)$, which scales linearly with input sequence length $n$. To make this attention pattern efficient, $w$ should be small compared with $n$. But a model with typical multiple stacked transformers will have a large receptive field. This is analogous to CNNs where stacking layers of small kernels leads to high level features that are built from a large portion of the input (receptive field)

In this case, with a transformer of $l$ layers, the receptive field size is $l × w$ (assuming $w$ is fixed for all layers). Depending on the application, it might be helpful to use different values of $w$ for each layer to balance between efficiency and model representation capacity.

Source: Longformer: The Long-Document Transformer


Paper Code Results Date Stars


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign