1 code implementation • 26 Mar 2024 • Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.
1 code implementation • 1 Feb 2024 • Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge
In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both.
no code implementations • 28 Oct 2022 • Aditya Cowsik, Tankut Can, Paolo Glorioso
Commonly used optimization algorithms often show a trade-off between good generalization and fast training times.