ReZero is All You Need: Fast Convergence at Large Depth

10 Mar 2020Thomas BachlechnerBodhisattwa Prasad MajumderHuanru Henry MaoGarrison W. CottrellJulian McAuley

Deep networks have enabled significant performance gains across domains, but they often suffer from vanishing/exploding gradients. This is especially true for Transformer architectures where depth beyond 12 layers is difficult to train without large datasets and computational budgets... (read more)

PDF Abstract

Evaluation Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.