1 code implementation • 31 Jan 2025 • Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation.
1 code implementation • 23 May 2024 • Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik
In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.
1 code implementation • 28 Dec 2023 • Artyom Eliseev, Denis Mazur
In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory.
2 code implementations • NeurIPS 2021 • Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko
Modern deep learning applications require increasingly more compute to train state-of-the-art models.
1 code implementation • NeurIPS 2019 • Denis Mazur, Vage Egiazarian, Stanislav Morozov, Artem Babenko
Our main contribution is PRODIGE: a method that learns a weighted graph representation of data end-to-end by gradient descent.