no code implementations • 19 Mar 2024 • Haocheng Xi, Yuxiang Chen, Kang Zhao, Kaijun Zheng, Jianfei Chen, Jun Zhu
Moreover, for a standard transformer block, our method offers an end-to-end training speedup of 1. 42x and a 1. 49x memory reduction compared to the FP16 baseline.