Attention Mechanisms

Disentangled Attention Mechanism

Introduced by He et al. in DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Disentangled Attention Mechanism is an attention mechanism used in the DeBERTa architecture. Unlike BERT where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.

Source: DeBERTa: Decoding-enhanced BERT with Disentangled Attention


Paper Code Results Date Stars


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign