Graph Self-Attention (GSA) is a self-attention module used in the BP-Transformer architecture, and is based on the graph attentional layer.
For a given node $u$, we update its representation according to its neighbour nodes, formulated as $\mathbf{h}_{u} \leftarrow \text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right)$.
Let $\mathbf{A}\left(u\right)$ denote the set of the neighbour nodes of $u$ in $\mathcal{G}$, $\text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right)$ is detailed as follows:
$$ \mathbf{A}^{u} = \text{concat}\left({\mathbf{h}_{v} | v \in \mathcal{A}\left(u\right)}\right) $$
$$ \mathbf{Q}^{u}_{i} = \mathbf{H}_{k}\mathbf{W}^{Q}_{i},\mathbf{K}_{i}^{u} = \mathbf{A}^{u}\mathbf{W}^{K}_{i},\mathbf{V}^{u}_{i} = \mathbf{A}^{u}\mathbf{W}_{i}^{V} $$
$$ \text{head}^{u}_{i} = \text{softmax}\left(\frac{\mathbf{Q}^{u}_{i}\mathbf{K}_{i}^{uT}}{\sqrt{d}}\right)\mathbf{V}_{i}^{u} $$
$$ \text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right) = \left[\text{head}^{u}_{1}, \dots, \text{head}^{u}_{h}\right]\mathbf{W}^{O}$$
where d is the dimension of h, and $\mathbf{W}^{Q}_{i}$, $\mathbf{W}^{K}_{i}$ and $\mathbf{W}^{V}_{i}$ are trainable parameters of the $i$-th attention head.
Source: BP-Transformer: Modelling Long-Range Context via Binary PartitioningPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Translation | 3 | 4.62% |
Sentence | 3 | 4.62% |
Graph Attention | 3 | 4.62% |
Autonomous Driving | 2 | 3.08% |
Graph Classification | 2 | 3.08% |
Graph Representation Learning | 2 | 3.08% |
Language Modelling | 2 | 3.08% |
BIG-bench Machine Learning | 2 | 3.08% |
Machine Translation | 2 | 3.08% |
Component | Type |
|
---|---|---|
Scaled Dot-Product Attention
|
Attention Mechanisms | |
Softmax
|
Output Functions |