no code implementations • 17 Jul 2023 • Dongning Ma, Xun Jiao, Fred Lin, Mengshi Zhang, Alban Desmaison, Thomas Sellinger, Daniel Moore, Sriram Sankar
Deep recommendation systems (DRS) heavily depend on specialized HPC hardware and accelerators to optimize energy, efficiency, and recommendation quality.
no code implementations • 7 Dec 2022 • Ruixuan Wang, Fred Lin, Daniel Moore, Sriram Sankar, Xun Jiao
Inspired by the inherent algorithmic resilience of DL methods, this paper conducts, for the first time, a large-scale and empirical study of GNN resilience, aiming to understand the relationship between hardware faults and GNN accuracy.
no code implementations • 22 Feb 2021 • Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar
This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations.
Hardware Architecture Distributed, Parallel, and Cluster Computing
no code implementations • 1 Nov 2019 • Fred Lin, Keyur Muzumdar, Nikolay Pavlovich Laptev, Mihai-Valentin Curelea, Seunghak Lee, Sriram Sankar
In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability.