1 code implementation • 18 Dec 2024 • Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang
By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e. g. 8-bit, while quantizing the rest to 4-bit.
no code implementations • 18 Oct 2024 • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang
Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision.
no code implementations • 15 Oct 2024 • Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size.
no code implementations • 12 May 2024 • Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Wanzin Yazar, Ilya Soloveychik, Xin Wang
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges.
no code implementations • 14 Apr 2024 • Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang
We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%.
1 code implementation • 11 Jul 2023 • Zihao Deng, Sayeh Sharify, Xin Wang, Michael Orshansky
Layerwise bit-widths are assigned by optimizing a new MPQ formulation based on cross-layer quantization errors using an Integer Quadratic Program.
no code implementations • 10 May 2018 • Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos
A Laconic configuration that uses a 1K-wire weight memory interface, outperforms the 2K-wire conventional accelerator by 15. 4x and is 1. 95x more energy efficient.
no code implementations • 17 Apr 2018 • Alberto Delmas, Sayeh Sharify, Patrick Judd, Kevin Siu, Milos Nikolic, Andreas Moshovos
The per group precisions are selected statically for the weights and dynamically by hardware for the activations.
no code implementations • 9 Mar 2018 • Alberto Delmas, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos
We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream properties.
no code implementations • 27 Jul 2017 • Alberto Delmas, Sayeh Sharify, Patrick Judd, Andreas Moshovos
Experiments on image classification CNNs show that on average across all networks studied, TRT outperforms a state-of-the-art bit-parallel accelerator by 1:90x without any loss in accuracy while it is 1:17x more energy efficient.
no code implementations • 23 Jun 2017 • Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, Andreas Moshovos
LM can trade-off accuracy for additional improvements in execution performance and energy efficiency and compares favorably to an accelerator that targeted only activation precisions.
no code implementations • 1 Jun 2017 • Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos
Stripes is a Deep Neural Network (DNN) accelerator that uses bit-serial computation to offer performance that is proportional to the fixed-point precision of the activation values.
no code implementations • 29 Apr 2017 • Patrick Judd, Alberto Delmas, Sayeh Sharify, Andreas Moshovos
We also present a modified organization that detects the activations that are deemed as ineffectual while fetching them from memory.