Search Results for author: Alexey Tumanov

Found 21 papers, 7 papers with code

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

no code implementations • 4 Mar 2024 • Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

Scheduling

Paper
Add Code

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

no code implementations • 27 Dec 2023 • Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov

Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources.

Scheduling

Paper
Add Code

Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off

no code implementations • 4 Dec 2023 • Sachit Kuhar, Yash Jain, Alexey Tumanov

Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential.

Binarization Computational Efficiency +2

Paper
Add Code

ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation

no code implementations • 24 Oct 2023 • Anshul Ahluwalia, Rohit Das, Payman Behnam, Alind Khare, Pan Li, Alexey Tumanov

To address this shortcoming, we propose a novel KD approach to GNN compression that we call Attention-Based Knowledge Distillation (ABKD).

Drug Discovery Fake News Detection +3

Paper
Add Code

Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding & Contextual Label Affinity

no code implementations • 20 Jul 2023 • Hugo Latapie, Shan Yu, Patrick Hammer, Kristinn R. Thorisson, Vahagn Petrosyan, Brandon Kynoch, Alind Khare, Payman Behnam, Alexey Tumanov, Aksheit Saxena, Anish Aralikatti, Hanning Chen, Mohsen Imani, Mike Archbold, Tangrui Li, Pei Wang, Justin Hart

Traditional computer vision models often necessitate extensive data acquisition, annotation, and validation.

Event Detection

Paper
Add Code

Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems

no code implementations • 3 Jul 2023 • Debopam Sanyal, Jui-Tse Hung, Manav Agrawal, Prahlad Jasti, Shahab Nikkhoo, Somesh Jha, Tianhao Wang, Sibin Mohan, Alexey Tumanov

Second, we counter the proposed attack with a noise-based defense mechanism that thwarts fingerprinting by adding noise to the specified performance metrics.

Model extraction

Paper
Add Code

Subgraph Stationary Hardware-Software Inference Co-Design

no code implementations • 21 Jun 2023 • Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexey Tumanov

For the stream of queries, SUSHI yields up to 25% improvement in latency, 0. 98% increase in served accuracy.

Quantization

Paper
Add Code

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

no code implementations • 20 Jun 2023 • Amey Agrawal, Sameer Reddy, Satwik Bhattamishra, Venkata Prabhakara Sarath Nookala, Vidushi Vashishth, Kexin Rong, Alexey Tumanov

With the increase in the scale of Deep Learning (DL) training workloads in terms of compute resources and time consumption, the likelihood of encountering in-training failures rises substantially, leading to lost work and resource wastage.

Model Compression Quantization +1

Paper
Add Code

SuperFed: Weight Shared Federated Learning

no code implementations • 26 Jan 2023 • Alind Khare, Animesh Agrawal, Myungjin Lee, Alexey Tumanov

We propose SuperFed - an architectural framework that incurs $O(1)$ cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning.

Federated Learning Privacy Preserving

Paper
Add Code

Signed Binary Weight Networks

no code implementations • 25 Nov 2022 • Sachit Kuhar, Alexey Tumanov, Judy Hoffman

Efficient inference of Deep Neural Networks (DNNs) is essential to making AI ubiquitous.

Binarization

Paper
Add Code

UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification

no code implementations • 26 Oct 2022 • Yanbo Xu, Alind Khare, Glenn Matlin, Monish Ramadoss, Rishikesan Kamaleswaran, Chao Zhang, Alexey Tumanov

It achieves within 0. 1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio-temporal cost of inference and earlier (3. 5hrs) disease onset prediction.

Image Classification

Paper
Add Code

CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment

1 code implementation • 26 Apr 2021 • Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov

The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints.

Paper
Code

CompOFA – Compound Once-For-All Networks for Faster Multi-Platform Deployment

2 code implementations • ICLR 2021 • Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov

The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constrains.

Paper
Code

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units

3 code implementations • 10 Aug 2020 • Shenda Hong, Yanbo Xu, Alind Khare, Satria Priambada, Kevin Maher, Alaa Aljiffry, Jimeng Sun, Alexey Tumanov

HOLMES is tested on risk prediction task on pediatric cardio ICU data with above 95% prediction accuracy and sub-second latency on 64-bed simulation.

Navigate

378

Paper
Code

HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline

no code implementations • 8 Jan 2020 • Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph Gonzalez, Ion Stoica, Alexey Tumanov

Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times.

Scheduling

Paper
Add Code

The OoO VLIW JIT Compiler for GPU Inference

no code implementations • 28 Jan 2019 • Paras Jain, Xiangxi Mo, Ajay Jain, Alexey Tumanov, Joseph E. Gonzalez, Ion Stoica

Current trends in Machine Learning~(ML) inference on hardware accelerated devices (e. g., GPUs, TPUs) point to alarmingly low utilization.

Paper
Add Code

Serverless Computing: One Step Forward, Two Steps Back

3 code implementations • 10 Dec 2018 • Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, Chenggang Wu

Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner.

Distributed, Parallel, and Cluster Computing Databases

Paper
Code

InferLine: ML Inference Pipeline Composition Framework

1 code implementation • 5 Dec 2018 • Daniel Crankshaw, Gur-Eyal Sela, Corey Zumar, Xiangxi Mo, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov

The dominant cost in production machine learning workloads is not training individual models but serving predictions from increasingly complex prediction pipelines spanning multiple models, machine learning frameworks, and parallel hardware accelerators.

Distributed, Parallel, and Cluster Computing

Paper
Code

Ray: A Distributed Framework for Emerging AI Applications

4 code implementations • 16 Dec 2017 • Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael. I. Jordan, Ion Stoica

To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state.

reinforcement-learning Reinforcement Learning (RL)

30,994

Paper
Code

IDK Cascades: Fast Deep Learning by Learning not to Overthink

no code implementations • 3 Jun 2017 • Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, Joseph E. Gonzalez

Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions.

Dialogue Generation

Paper
Add Code

Real-Time Machine Learning: The Missing Pieces

2 code implementations • 11 Mar 2017 • Robert Nishihara, Philipp Moritz, Stephanie Wang, Alexey Tumanov, William Paul, Johann Schleier-Smith, Richard Liaw, Mehrdad Niknami, Michael. I. Jordan, Ion Stoica

Machine learning applications are increasingly deployed not only to serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making.

BIG-bench Machine Learning Decision Making

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.