no code implementations • 14 Feb 2024 • Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya Ghobadi
We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters.
no code implementations • 1 Aug 2023 • Sudarsanan Rajasekaran, Manya Ghobadi, Aditya Akella
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.