no code implementations • 2 Apr 2024 • Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu
Training large Transformers is slow, but recent innovations on GPU architecture gives us an advantage.