Search Results for author: Amey Agrawal

Found 6 papers, 2 papers with code

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

no code implementations4 Mar 2024 Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

Scheduling

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

no code implementations31 Aug 2023 Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.

Language Modelling Large Language Model

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

no code implementations20 Jun 2023 Amey Agrawal, Sameer Reddy, Satwik Bhattamishra, Venkata Prabhakara Sarath Nookala, Vidushi Vashishth, Kexin Rong, Alexey Tumanov

With the increase in the scale of Deep Learning (DL) training workloads in terms of compute resources and time consumption, the likelihood of encountering in-training failures rises substantially, leading to lost work and resource wastage.

Model Compression Quantization +1

Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

no code implementations16 Feb 2022 Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie Ailijiang, Suresh Krishnappa, Mark Russinovich

At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of AI accelerators (e. g., GPUs, FPGAs).

Scheduling

Learning Digital Circuits: A Journey Through Weight Invariant Self-Pruning Neural Networks

1 code implementation30 Aug 2019 Amey Agrawal, Rohit Karlupia

Recently, in the paper "Weight Agnostic Neural Networks" Gaier & Ha utilized architecture search to find networks where the topology completely encodes the knowledge.

Cannot find the paper you are looking for? You can Submit a new open access paper.