Methods > Natural Language Processing > Attention Patterns

Fixed Factorized Attention

Introduced by Child et al. in Generating Long Sequences with Sparse Transformers

Fixed Factorized Attention is a factorized attention pattern where specific cells summarize previous locations and propagate that information to all future cells. It was proposed as part of the Sparse Transformer architecture.

A self-attention layer maps a matrix of input embeddings $X$ to an output matrix and is parameterized by a connectivity pattern $S = \text{set}\left(S_{1}, \dots, S_{n}\right)$, where $S_{i}$ denotes the set of indices of the input vectors to which the $i$th output vector attends. The output vector is a weighted sum of transformations of the input vectors:

$$ \text{Attend}\left(X, S\right) = \left(a\left(\mathbf{x}_{i}, S_{i}\right)\right)_{i\in\text{set}\left(1,\dots,n\right)}$$

$$ a\left(\mathbf{x}_{i}, S_{i}\right) = \text{softmax}\left(\frac{\left(W_{q}\mathbf{x}_{i}\right)K^{T}_{S_{i}}}{\sqrt{d}}\right)V_{S_{i}} $$

$$ K_{Si} = \left(W_{k}\mathbf{x}_{j}\right)_{j\in{S_{i}}} $$

$$ V_{Si} = \left(W_{v}\mathbf{x}_{j}\right)_{j\in{S_{i}}} $$

Here $W_{q}$, $W_{k}$, and $W_{v}$ represent the weight matrices which transform a given $x_{i}$ into a query, key, or value, and $d$ is the inner dimension of the queries and keys. The output at each position is a sum of the values weighted by the scaled dot-product similarity of the keys and queries.

Full self-attention for autoregressive models defines $S_{i} = \text{set}\left(j : j \leq i\right)$, allowing every element to attend to all previous positions and its own position.

Factorized self-attention instead has $p$ separate attention heads, where the $m$th head defines a subset of the indices $A_{i}^{(m)} ⊂ \text{set}\left(j : j \leq i\right)$ and lets $S_{i} = A_{i}^{(m)}$. The goal with the Sparse Transformer was to find efficient choices for the subset $A$.

Formally for Fixed Factorized Attention, $A^{(1)}_{i} = ${$j : \left(\lfloor{j/l\rfloor}=\lfloor{i/l\rfloor}\right)$}, where the brackets denote the floor operation, and $A^{(2)}_{i} = ${$j : j \mod l \in ${$t, t+1, \ldots, l$}}, where $t=l-c$ and $c$ is a hyperparameter. The $i$-th output vector of the attention head attends to all input vectors either from $A^{(1)}_{i}$ or $A^{(2)}_{i}$. This pattern can be visualized in the figure to the right.

If the stride is 128 and $c = 8$, then all future positions greater than 128 can attend to positions 120-128, all positions greater than 256 can attend to 248-256, and so forth.

A fixed-attention pattern with $c = 1$ limits the expressivity of the network significantly, as many representations in the network are only used for one block whereas a small number of locations are used by all blocks. The authors found choosing $c \in ${$8, 16, 32$} for typical values of $l \in {128, 256}$ performs well, although this increases the computational cost of this method by $c$ in comparison to the strided attention.

Additionally, the authors found that when using multiple heads, having them attend to distinct subblocks of length $c$ within the block of size $l$ was preferable to having them attend to the same subblock.

Source: Generating Long Sequences with Sparse Transformers

Latest Papers

One Model to Rule them All: Towards Zero-Shot Learning for Databases
Benjamin HilprechtCarsten Binnig
Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks
Tatyana IazykovaDenis KapelyushnikOlga BystrovaAndrey Kutuzov
Entailment as Few-Shot Learner
Sinong WangHan FangMadian KhabsaHanzi MaoHao Ma
PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
Wei ZengXiaozhe RenTeng SuHui WangYi LiaoZhiwei WangXin JiangZhenZhang YangKaisheng WangXiaoda ZhangChen LiZiyan GongYifan YaoXinjing HuangJun WangJianfeng YuQi GuoYue YuYan ZhangJin WangHengtao TaoDasen YanZexuan YiFang PengFangqing JiangHan ZhangLingfeng DengYehong ZhangZhe LinChao ZhangShaojie ZhangMingyue GuoShanzhi GuGaojun FanYaoWei WangXuefeng JinQun LiuYonghong Tian
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
Kang Min YooDongju ParkJaewook KangSang-Woo LeeWoomyeong Park
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Yao LuMax BartoloAlastair MooreSebastian RiedelPontus Stenetorp
Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions
Swaroop MishraDaniel KhashabiChitta BaralHannaneh Hajishirzi
A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation
Tianyu LiuYizhe ZhangChris BrockettYi MaoZhifang SuiWeizhu ChenBill Dolan
The Power of Scale for Parameter-Efficient Prompt Tuning
| Brian LesterRami Al-RfouNoah Constant
An Adversarially-Learned Turing Test for Dialog Generation Models
| Xiang GaoYizhe ZhangMichel GalleyBill Dolan
Text2App: A Framework for Creating Android Apps from Text Descriptions
| Masum HasanKazi Sajeed MehrabWasi Uddin AhmadRifat Shahriyar
Surface Form Competition: Why the Highest Probability Answer Isn't Always Right
| Ari HoltzmanPeter WestVered SchwartzYejin ChoiLuke Zettlemoyer
Meta-tuning Language Models to Answer Prompts Better
Ruiqi ZhongKristy LeeZheng ZhangDan Klein
Automatic Graph Partitioning for Very Large-scale Deep Learning
Masahiro TanakaKenjiro TauraToshihiro HanawaKentaro Torisawa
Detecting Hate Speech with GPT-3
| Ke-Li ChiuRohan Alexander
Calibrate Before Use: Improving Few-Shot Performance of Language Models
| Tony Z. ZhaoEric WallaceShi FengDan KleinSameer Singh
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
Zhuohan LiSiyuan ZhuangShiyuan GuoDanyang ZhuoHao ZhangDawn SongIon Stoica
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
Laria ReynoldsKyle McDonell
Multiversal views on language models
Laria ReynoldsKyle McDonell
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers
Chaoyang HeShen LiMahdi SoltanolkotabiSalman Avestimehr
Understanding Emails and Drafting Responses -- An Approach Using GPT-3
Jonas ThiergartStefan HuberThomas Übellacker
Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models
Alex TamkinMiles BrundageJack ClarkDeep Ganguli
"Is depression related to cannabis?": A knowledge-infused model for Entity and Relation Extraction with Limited Supervision
Kaushik RoyUsha LokalaVedant KhandelwalAmit Sheth
Persistent Anti-Muslim Bias in Large Language Models
Abubakar AbidMaheen FarooqiJames Zou
How Multipurpose Are Language Models?
Making Pre-trained Language Models Better Few-shot Learners
| Tianyu GaoAdam FischDanqi Chen
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
| Leo GaoStella BidermanSid BlackLaurence GoldingTravis HoppeCharles FosterJason PhangHorace HeAnish ThiteNoa NabeshimaShawn PresserConnor Leahy
Revisiting Linformer with a modified self-attention with linear complexity
Madhusudan Verma
Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment
Julien LaunayIacopo PoliKilian MüllerGustave ParienteIgor CarronLaurent DaudetFlorent KrzakalaSylvain Gigan
CPM: A Large-scale Generative Chinese Pre-trained Language Model
| Zhengyan ZhangXu HanHao ZhouPei KeYuxian GuDeming YeYujia QinYusheng SuHaozhe JiJian GuanFanchao QiXiaozhi WangYanan ZhengGuoyang ZengHuanqi CaoShengqi ChenDaixuan LiZhenbo SunZhiyuan LiuMinlie HuangWentao HanJie TangJuanzi LiXiaoyan ZhuMaosong Sun
Do Fine-tuned Commonsense Language Models Really Generalize?
Mayank KejriwalKe Shen
COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs
Jena D. HwangChandra BhagavatulaRonan Le BrasJeff DaKeisuke SakaguchiAntoine BosselutYejin Choi
Toward a Thermodynamics of Meaning
| Jonathan Scott Enderle
It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
| Timo SchickHinrich Schütze
The Radicalization Risks of GPT-3 and Advanced Neural Language Models
Kris McGuffieAlex Newhouse
Unit Test Case Generation with Transformers
| Michele TufanoDawn DrainAlexey SvyatkovskiyShao Kun DengNeel Sundaresan
Measuring Massive Multitask Language Understanding
| Dan HendrycksCollin BurnsSteven BasartAndy ZouMantas MazeikaDawn SongJacob Steinhardt
Discrete Word Embedding for Logical Natural Language Understanding
Masataro AsaiZilu Tang
Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems
Andrea MadottoZihan LiuZhaojiang LinPascale Fung
Language Models are Few-Shot Learners
| Tom B. BrownBenjamin MannNick RyderMelanie SubbiahJared KaplanPrafulla DhariwalArvind NeelakantanPranav ShyamGirish SastryAmanda AskellSandhini AgarwalAriel Herbert-VossGretchen KruegerTom HenighanRewon ChildAditya RameshDaniel M. ZieglerJeffrey WuClemens WinterChristopher HesseMark ChenEric SiglerMateusz LitwinScott GrayBenjamin ChessJack ClarkChristopher BernerSam McCandlishAlec RadfordIlya SutskeverDario Amodei
Generating Long Sequences with Sparse Transformers
| Rewon ChildScott GrayAlec RadfordIlya Sutskever


🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign