no code implementations • 18 Oct 2024 • Zifei Xu, Sayeh Sharify, Wanzin Yazar, Tristan Webb, Xin Wang
Large language models of high parameter counts are computationally expensive, yet can be made much more efficient by compressing their weights to very low numerical precision.
no code implementations • 15 Oct 2024 • Zifei Xu, Alexander Lan, Wanzin Yazar, Tristan Webb, Sayeh Sharify, Xin Wang
Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size.
no code implementations • 12 May 2024 • Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Wanzin Yazar, Ilya Soloveychik, Xin Wang
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges.
no code implementations • 14 Apr 2024 • Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang
We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%.