Uncovering hidden geometry in Transformers via disentangling position and context

1 code implementation7 Oct 2023 Jiajun Song, Yiqiao Zhong

Given embedding vector $\boldsymbol{h}_{c, t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c, t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c, t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c, t}$ is the residual vector.

Dictionary Learning POS

