Graph SelfAttention (GSA) is a selfattention module used in the BPTransformer architecture, and is based on the graph attentional layer.
For a given node $u$, we update its representation according to its neighbour nodes, formulated as $\mathbf{h}_{u} \leftarrow \text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right)$.
Let $\mathbf{A}\left(u\right)$ denote the set of the neighbour nodes of $u$ in $\mathcal{G}$, $\text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right)$ is detailed as follows:
$$ \mathbf{A}^{u} = \text{concat}\left({\mathbf{h}_{v}  v \in \mathcal{A}\left(u\right)}\right) $$
$$ \mathbf{Q}^{u}_{i} = \mathbf{H}_{k}\mathbf{W}^{Q}_{i},\mathbf{K}_{i}^{u} = \mathbf{A}^{u}\mathbf{W}^{K}_{i},\mathbf{V}^{u}_{i} = \mathbf{A}^{u}\mathbf{W}_{i}^{V} $$
$$ \text{head}^{u}_{i} = \text{softmax}\left(\frac{\mathbf{Q}^{u}_{i}\mathbf{K}_{i}^{uT}}{\sqrt{d}}\right)\mathbf{V}_{i}^{u} $$
$$ \text{GSA}\left(\mathcal{G}, \mathbf{h}^{u}\right) = \left[\text{head}^{u}_{1}, \dots, \text{head}^{u}_{h}\right]\mathbf{W}^{O}$$
where d is the dimension of h, and $\mathbf{W}^{Q}_{i}$, $\mathbf{W}^{K}_{i}$ and $\mathbf{W}^{V}_{i}$ are trainable parameters of the $i$th attention head.
Source: BPTransformer: Modelling LongRange Context via Binary PartitioningPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Translation  3  4.62% 
Sentence  3  4.62% 
Graph Attention  3  4.62% 
Autonomous Driving  2  3.08% 
Graph Classification  2  3.08% 
Graph Representation Learning  2  3.08% 
Language Modelling  2  3.08% 
BIGbench Machine Learning  2  3.08% 
Machine Translation  2  3.08% 
Component  Type 


Scaled DotProduct Attention

Attention Mechanisms  
Softmax

Output Functions 