Search Results for author: Alexey Tumanov

Found 21 papers, 7 papers with code

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

no code implementations4 Mar 2024 Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

Scheduling

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

no code implementations27 Dec 2023 Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov

Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources.

Scheduling

ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation

no code implementations24 Oct 2023 Anshul Ahluwalia, Rohit Das, Payman Behnam, Alind Khare, Pan Li, Alexey Tumanov

To address this shortcoming, we propose a novel KD approach to GNN compression that we call Attention-Based Knowledge Distillation (ABKD).

Drug Discovery Fake News Detection +3

Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems

no code implementations3 Jul 2023 Debopam Sanyal, Jui-Tse Hung, Manav Agrawal, Prahlad Jasti, Shahab Nikkhoo, Somesh Jha, Tianhao Wang, Sibin Mohan, Alexey Tumanov

Second, we counter the proposed attack with a noise-based defense mechanism that thwarts fingerprinting by adding noise to the specified performance metrics.

Model extraction

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

no code implementations20 Jun 2023 Amey Agrawal, Sameer Reddy, Satwik Bhattamishra, Venkata Prabhakara Sarath Nookala, Vidushi Vashishth, Kexin Rong, Alexey Tumanov

With the increase in the scale of Deep Learning (DL) training workloads in terms of compute resources and time consumption, the likelihood of encountering in-training failures rises substantially, leading to lost work and resource wastage.

Model Compression Quantization +1

SuperFed: Weight Shared Federated Learning

no code implementations26 Jan 2023 Alind Khare, Animesh Agrawal, Myungjin Lee, Alexey Tumanov

We propose SuperFed - an architectural framework that incurs $O(1)$ cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning.

Federated Learning Privacy Preserving

Signed Binary Weight Networks

no code implementations25 Nov 2022 Sachit Kuhar, Alexey Tumanov, Judy Hoffman

Efficient inference of Deep Neural Networks (DNNs) is essential to making AI ubiquitous.

Binarization

UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification

no code implementations26 Oct 2022 Yanbo Xu, Alind Khare, Glenn Matlin, Monish Ramadoss, Rishikesan Kamaleswaran, Chao Zhang, Alexey Tumanov

It achieves within 0. 1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio-temporal cost of inference and earlier (3. 5hrs) disease onset prediction.

Image Classification

CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment

1 code implementation26 Apr 2021 Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov

The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints.

CompOFA – Compound Once-For-All Networks for Faster Multi-Platform Deployment

2 code implementations ICLR 2021 Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov

The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constrains.

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units

3 code implementations10 Aug 2020 Shenda Hong, Yanbo Xu, Alind Khare, Satria Priambada, Kevin Maher, Alaa Aljiffry, Jimeng Sun, Alexey Tumanov

HOLMES is tested on risk prediction task on pediatric cardio ICU data with above 95% prediction accuracy and sub-second latency on 64-bed simulation.

Navigate

HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline

no code implementations8 Jan 2020 Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph Gonzalez, Ion Stoica, Alexey Tumanov

Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times.

Scheduling

The OoO VLIW JIT Compiler for GPU Inference

no code implementations28 Jan 2019 Paras Jain, Xiangxi Mo, Ajay Jain, Alexey Tumanov, Joseph E. Gonzalez, Ion Stoica

Current trends in Machine Learning~(ML) inference on hardware accelerated devices (e. g., GPUs, TPUs) point to alarmingly low utilization.

Serverless Computing: One Step Forward, Two Steps Back

3 code implementations10 Dec 2018 Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, Chenggang Wu

Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner.

Distributed, Parallel, and Cluster Computing Databases

InferLine: ML Inference Pipeline Composition Framework

1 code implementation5 Dec 2018 Daniel Crankshaw, Gur-Eyal Sela, Corey Zumar, Xiangxi Mo, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov

The dominant cost in production machine learning workloads is not training individual models but serving predictions from increasingly complex prediction pipelines spanning multiple models, machine learning frameworks, and parallel hardware accelerators.

Distributed, Parallel, and Cluster Computing

Ray: A Distributed Framework for Emerging AI Applications

4 code implementations16 Dec 2017 Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael. I. Jordan, Ion Stoica

To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state.

reinforcement-learning Reinforcement Learning (RL)

IDK Cascades: Fast Deep Learning by Learning not to Overthink

no code implementations3 Jun 2017 Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, Joseph E. Gonzalez

Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions.

Dialogue Generation

Real-Time Machine Learning: The Missing Pieces

2 code implementations11 Mar 2017 Robert Nishihara, Philipp Moritz, Stephanie Wang, Alexey Tumanov, William Paul, Johann Schleier-Smith, Richard Liaw, Mehrdad Niknami, Michael. I. Jordan, Ion Stoica

Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making.

BIG-bench Machine Learning Decision Making

Cannot find the paper you are looking for? You can Submit a new open access paper.