no code implementations • 2 Oct 2022 • Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D Mullins
Given the wide and ever growing range of different efficient Transformer attention mechanisms, it is important to identify which attention is most effective when given a task.
no code implementations • 2 Oct 2022 • Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D Mullins
We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch.