Search Results for author: Jayashree Mohan

Found 6 papers, 2 papers with code

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

no code implementations4 Mar 2024 Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

Scheduling

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

no code implementations31 Aug 2023 Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.

Language Modelling Large Language Model

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

no code implementations12 Oct 2021 Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay Chidambaram

Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources.

Scheduling

Analyzing and Mitigating Data Stalls in DNN Training

no code implementations14 Jul 2020 Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram

We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft.

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes

2 code implementations23 Sep 2019 Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, Vijay Chidambaram

We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM).

Distributed, Parallel, and Cluster Computing Databases Data Structures and Algorithms

Cannot find the paper you are looking for? You can Submit a new open access paper.