Transformers

BigBird

Introduced by Zaheer et al. in Big Bird: Transformers for Longer Sequences

BigBird is a Transformer with a sparse attention mechanism that reduces the quadratic dependency of self-attention to linear in the number of tokens. BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. In particular, BigBird consists of three main parts:

  • A set of $g$ global tokens attending on all parts of the sequence.
  • All tokens attending to a set of $w$ local neighboring tokens.
  • All tokens attending to a set of $r$ random tokens.

This leads to a high performing attention mechanism scaling to much longer sequence lengths (8x).

Source: Big Bird: Transformers for Longer Sequences

Papers


Paper Code Results Date Stars

Components


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories