3 code implementations • ICLR 2022 • Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, Christian Szegedy
Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights.
3 code implementations • 11 Mar 2022 • DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur
It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens.
2 code implementations • 7 Feb 2017 • Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, Peter Norvig
However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference.