Cross-Covariance Attention, or XCA, is an attention mechanism which operates along the feature dimension instead of the token dimension as in conventional transformers.
Using the definitions of queries, keys and values from conventional attention, the cross-covariance attention function is defined as:
$$ \text { XC-Attention }(Q, K, V)=V \mathcal{A}_{\mathrm{XC}}(K, Q), \quad \mathcal{A}_{\mathrm{XC}}(K, Q)=\operatorname{Softmax}\left(\hat{K}^{\top} \hat{Q} / \tau\right) $$
where each output token embedding is a convex combination of the $d_{v}$ features of its corresponding token embedding in $V$. The attention weights $\mathcal{A}$ are computed based on the cross-covariance matrix.
Source: XCiT: Cross-Covariance Image TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 3 | 25.00% |
Denoising | 1 | 8.33% |
Motion Magnification | 1 | 8.33% |
Quantization | 1 | 8.33% |
Decoder | 1 | 8.33% |
Pose Estimation | 1 | 8.33% |
Instance Segmentation | 1 | 8.33% |
Object Detection | 1 | 8.33% |
Self-Supervised Image Classification | 1 | 8.33% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |