A Dot Product Attention Free Transformer
We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers \citep{transformer} that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
PDF AbstractTask | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Image Classification | ImageNet | DAFT-full | Top 1 Accuracy | 79.8% | # 676 | |
Number of params | 22.6M | # 570 | ||||
Image Classification | ImageNet | DAFT-conv (384 heads, 200 epochs) | Top 1 Accuracy | 80.1% | # 659 | |
Number of params | 23M | # 574 | ||||
Image Classification | ImageNet | DAFT-conv (16 heads) | Top 1 Accuracy | 80.2% | # 655 | |
Number of params | 20.3M | # 541 | ||||
Image Classification | ImageNet | DAFT-conv (384 heads, 300 epochs) | Top 1 Accuracy | 80.8% | # 623 | |
Number of params | 23M | # 574 |