Disentangled Attention Mechanism

Introduced by He et al. in DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Disentangled Attention Mechanism is an attention mechanism used in the DeBERTa architecture. Unlike BERT where each word in the input layer is represented using a vector which is the sum of its word (content) embedding and position embedding, each word in DeBERTa is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices based on their contents and relative positions, respectively. This is motivated by the observation that the attention weight of a word pair depends on not only their contents but their relative positions. For example, the dependency between the words “deep” and “learning” is much stronger when they occur next to each other than when they occur in different sentences.

Source: DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	21	15.00%
Sentence	10	7.14%
Question Answering	8	5.71%
Natural Language Understanding	7	5.00%
Natural Language Inference	6	4.29%
Sentiment Analysis	5	3.57%
Named Entity Recognition (NER)	4	2.86%
Self-Supervised Learning	3	2.14%
Large Language Model	3	2.14%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention Mechanisms