A Dot Product Attention Free Transformer

We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers \citep{transformer} that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Classification ImageNet DAFT-full Top 1 Accuracy 79.8% # 668
Number of params 22.6M # 561
Image Classification ImageNet DAFT-conv (384 heads, 200 epochs) Top 1 Accuracy 80.1% # 651
Number of params 23M # 565
Image Classification ImageNet DAFT-conv (16 heads) Top 1 Accuracy 80.2% # 647
Number of params 20.3M # 533
Image Classification ImageNet DAFT-conv (384 heads, 300 epochs) Top 1 Accuracy 80.8% # 615
Number of params 23M # 565

Methods