no code implementations • 13 Oct 2020 • Alessio Netti, Daniele Tafani, Michael Ott, Martin Schulz
Modern High-Performance Computing (HPC) and data center operators rely more and more on data analytics techniques to improve the efficiency and reliability of their operations.
no code implementations • 27 Jul 2020 • Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments.
no code implementations • 14 Oct 2019 • Alessio Netti, Micha Mueller, Carla Guillen, Michael Ott, Daniele Tafani, Gence Ozer, Martin Schulz
However, while monitoring is a common reality in HPC, there is no well-stated and comprehensive list of requirements, nor matching frameworks, to support holistic and online ODA.
no code implementations • 26 Oct 2018 • Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates.
Distributed, Parallel, and Cluster Computing
1 code implementation • 26 Jul 2018 • Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments.
Distributed, Parallel, and Cluster Computing