no code implementations • 14 Feb 2024 • Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya Ghobadi
We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters.
no code implementations • 1 Aug 2023 • Sudarsanan Rajasekaran, Manya Ghobadi, Aditya Akella
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
no code implementations • 22 Jul 2023 • Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth communication to achieve near-optimal training performance.
no code implementations • 31 Mar 2023 • Homa Esfahanizadeh, Adam Yala, Rafael G. L. D'Oliveira, Andrea J. D. Jaba, Victor Quach, Ken R. Duffy, Tommi S. Jaakkola, Vinod Vaikuntanathan, Manya Ghobadi, Regina Barzilay, Muriel Médard
Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice.
1 code implementation • 4 Jun 2021 • Adam Yala, Homa Esfahanizadeh, Rafael G. L. D' Oliveira, Ken R. Duffy, Manya Ghobadi, Tommi S. Jaakkola, Vinod Vaikuntanathan, Regina Barzilay, Muriel Medard
We propose to approximate this family of encoding functions through random deep neural networks.