Search Results for author: Gregory R. Ganger

Found 7 papers, 3 papers with code

Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training

no code implementations8 Apr 2025 Daiyaan Arfeen, Dheevatsa Mudigere, Ankit More, Bhargava Gopireddy, Ahmet Inci, Gregory R. Ganger

We also propose a rack-design with improved electrical and thermal capabilities in order to sustain power-boosting of scale-up domains that have experienced failures; combined with NTP, this can allow the DP replica with the reduced TP degree (i. e., with failed GPUs) to keep up with the others, thereby achieving near-zero throughput loss for large-scale LLM training.

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

no code implementations23 Sep 2024 Daiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory R. Ganger, Yida Wang

Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation.

8k

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

no code implementations24 Jun 2024 Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, Zhihao Jia

GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies.

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

2 code implementations27 Aug 2020 Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing

Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize the provided resources.

Deep Learning Fairness +1

Accelerating Deep Learning by Focusing on the Biggest Losers

2 code implementations2 Oct 2019 Angela H. Jiang, Daniel L. -K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai

This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration.

Deep Learning

MLtuner: System Support for Automatic Machine Learning Tuning

1 code implementation20 Mar 2018 Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons

MLtuner automatically tunes settings for training tunables (such as the learning rate, the momentum, the mini-batch size, and the data staleness bound) that have a significant impact on large-scale machine learning (ML) performance.

BIG-bench Machine Learning General Classification +3

Cannot find the paper you are looking for? You can Submit a new open access paper.