no code implementations • 15 Jan 2024 • Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti
In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs).
no code implementations • 5 Jan 2024 • Adnan Hoque, Less Wright, Chih-Chieh Yang, Mudhakar Srivatsa, Raghu Ganti
Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads.