Model Compression
342 papers with code • 2 benchmarks • 4 datasets
Model Compression is an actively pursued area of research over the last few years with the goal of deploying state-of-the-art deep networks in low-power and resource limited devices without significant drop in accuracy. Parameter pruning, low-rank factorization and weight quantization are some of the proposed methods to compress the size of deep networks.
Libraries
Use these libraries to find Model Compression models and implementationsLatest papers
PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning
Additionally, to adjust the impact of inaccuracies in multimedia data, a disentangled multi-modal list-wise distillation is developed with modality-aware re-weighting mechanism.
LLM Inference Unveiled: Survey and Roofline Model Insights
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques.
A Survey on Knowledge Distillation of Large Language Models
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral.
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
Diffusion models have achieved remarkable success in image generation tasks, yet their practical deployment is restrained by the high memory and time consumption.
The Potential of AutoML for Recommender Systems
We found that AutoML and AutoRecSys libraries performed best.
Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward
Despite the impressive performance of LLMs, their widespread adoption faces challenges due to substantial computational and memory requirements during inference.
LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection
To our knowledge, for the very first time in lidar-based 3D detection tasks, the PTQ INT8 model's accuracy is almost the same as the FP32 model while enjoying $3\times$ inference speedup.
TQCompressor: improving tensor decomposition methods in neural networks via permutations
The result of the compression is TQCompressedGPT-2 model, featuring 81 mln.
Communication-Efficient Federated Learning through Adaptive Weight Clustering and Server-Side Distillation
Federated Learning (FL) is a promising technique for the collaborative training of deep neural networks across multiple devices while preserving data privacy.
Model Compression Techniques in Biometrics Applications: A Survey
The development of deep learning algorithms has extensively empowered humanity's task automatization capacity.