Search Results for author: Kalyan Saladi

Found 3 papers, 1 papers with code

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

no code implementations20 Nov 2024 Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources.

Revisiting Reliability in Large-Scale Machine Learning Research Clusters

no code implementations29 Oct 2024 Apostolos Kokolis, Michael Kuchnik, John Hoffman, Adithya Kumar, Parth Malani, Faye Ma, Zachary DeVito, Shubho Sengupta, Kalyan Saladi, Carole-Jean Wu

Reliability is a fundamental challenge in operating large-scale machine learning (ML) infrastructures, particularly as the scale of ML models and training clusters continues to grow.

Cannot find the paper you are looking for? You can Submit a new open access paper.