1 code implementation • 8 Jul 2023 • Kevin Christian Wibisono, Yixin Wang
The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights.