Search Results for author: Denis Kuznedelev

Found 20 papers, 11 papers with code

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

1 code implementation8 Apr 2025 Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Erik Schultheis, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh

In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate.

Scale-wise Distillation of Diffusion Models

no code implementations20 Mar 2025 Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, Dmitry Baranchuk

We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators.

Denoising

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

1 code implementation31 Jan 2025 Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation.

Quantization

Label Privacy in Split Learning for Large Models with Parameter-Efficient Training

1 code implementation21 Dec 2024 Philip Zmushko, Marat Mansurov, Ruslan Svirschevski, Denis Kuznedelev, Max Ryabinin, Aleksandr Beznosikov

Using this analysis, we propose P$^3$EFT, a multi-party split learning algorithm that takes advantage of existing PEFT properties to maintain privacy at a lower performance overhead.

parameter-efficient fine-tuning Privacy Preserving +1

EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search

1 code implementation18 Oct 2024 Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh

Yet, current methods rely on heuristics for identifying the "importance" of a given layer towards the loss, based on assumptions such as \emph{error monotonicity}, i. e. that the end-to-end model compression error is proportional to the sum of layer-wise errors.

Model Compression Quantization

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

no code implementations31 Aug 2024 Vage Egiazarian, Denis Kuznedelev, Anton Voronov, Ruslan Svirschevski, Michael Goin, Daniil Pavlov, Dan Alistarh, Dmitry Baranchuk

Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.

Image Generation Quantization

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

no code implementations30 Aug 2024 Diyuan Wu, Ionut-Vlad Modoranu, Mher Safaryan, Denis Kuznedelev, Dan Alistarh

The rising footprint of machine learning has led to a focus on imposing \emph{model sparsity} as a means of reducing computational and memory costs.

Does Diffusion Beat GAN in Image Super Resolution?

1 code implementation27 May 2024 Denis Kuznedelev, Valerii Startsev, Daniil Shlenskii, Sergey Kastryulin

There is a prevalent opinion that diffusion-based models outperform GAN-based counterparts in the Image Super Resolution (ISR) problem.

Image Super-Resolution

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

1 code implementation23 May 2024 Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.

Quantization

Extreme Compression of Large Language Models via Additive Quantization

1 code implementation11 Jan 2024 Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices.

Information Retrieval Quantization

Sparse Fine-tuning for Inference Acceleration of Large Language Models

1 code implementation10 Oct 2023 Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh

While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth.

Quantization Text Generation +1

Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

no code implementations3 Aug 2023 Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community.

Model Compression Network Pruning +1

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

no code implementations25 Mar 2023 Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

To address this, we ask: can we quickly compress large generalist models into accurate and efficient specialists?

A critical look at the evaluation of GNNs under heterophily: Are we really making progress?

3 code implementations22 Feb 2023 Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, Liudmila Prokhorenkova

Graphs without this property are called heterophilous, and it is typically assumed that specialized methods are required to achieve strong performance on such graphs.

Graph Representation Learning Node Classification

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

no code implementations NeurIPS 2023 Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

To further showcase CAP's accuracy and scalability, we use it to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Image Classification Quantization

Cannot find the paper you are looking for? You can Submit a new open access paper.