1 code implementation • 19 Mar 2024 • Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell
Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S$^2$ can match or even exceed the advantage of larger models.