Search Results for author: Saeed Rashidi

Found 7 papers, 2 papers with code

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

3 code implementations24 Mar 2023 William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms.

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

no code implementations30 Nov 2022 Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis

To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

no code implementations22 Jul 2022 Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna

Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.

Blocking

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

no code implementations9 Oct 2021 Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e. g., GPU/TPU).

Scheduling

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

no code implementations24 Sep 2021 William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

High-performance distributed training platforms should leverage multi-dimensional hierarchical networks, which interconnect accelerators through different levels of the network, to dramatically reduce expensive NICs required for the scale-out network.

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

no code implementations19 Aug 2020 Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna

In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a. k. a.

Cannot find the paper you are looking for? You can Submit a new open access paper.