no code implementations • 25 Nov 2024 • Satoki Ishikawa, Tal Ben-Nun, Brian Van Essen, Rio Yokota, Nikoli Dryden
Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck.
1 code implementation • 3 Oct 2023 • Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler
We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs.
no code implementations • 23 Aug 2023 • Julia Bazinska, Andrei Ivanov, Tal Ben-Nun, Nikoli Dryden, Maciej Besta, Siyuan Shen, Torsten Hoefler
Graph Neural Networks (GNNs) are a powerful tool for handling structured graph data and addressing tasks such as node classification, graph classification, and clustering.
no code implementations • 15 Apr 2023 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Saleh Ashkboos, Torsten Hoefler
As deep learning models grow, sparsity is becoming an increasingly critical component of deep neural networks, enabling improved performance and reduced storage.
no code implementations • 14 Mar 2023 • Lukas Trümper, Tal Ben-Nun, Philipp Schaad, Alexandru Calotoiu, Torsten Hoefler
Performance optimization is an increasingly challenging but often repetitive task.
no code implementations • 3 Jan 2023 • Niels Gleinig, Tal Ben-Nun, Torsten Hoefler
We can only process data that is stored in fast memory, which incurs data movement (input/output-operations, or I/Os) between the two units.
1 code implementation • 29 Jun 2022 • Saleh Ashkboos, Langwen Huang, Nikoli Dryden, Tal Ben-Nun, Peter Dueben, Lukas Gianinazzi, Luca Kummer, Torsten Hoefler
We propose the ENS-10 prediction correction task for improving the forecast quality at a 48-hour lead time through ensemble post-processing.
1 code implementation • 20 Oct 2021 • Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, Torsten Hoefler
Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute.
no code implementations • 7 Jun 2021 • Lukas Gianinazzi, Maximilian Fries, Nikoli Dryden, Tal Ben-Nun, Maciej Besta, Torsten Hoefler
We present a novel neural architecture to solve graph optimization problems where the solution consists of arbitrary node labels, allowing us to solve hard problems like graph coloring.
no code implementations • 31 Jan 2021 • Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste
The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components.
no code implementations • 21 Jan 2021 • Nikoli Dryden, Roman Böhringer, Tal Ben-Nun, Torsten Hoefler
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments.
2 code implementations • 1 Jan 2021 • Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten Hoefler, Daniel Soudry
Although trained on images of a specific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps.
no code implementations • 21 Nov 2020 • Chris Cummins, Hugh Leather, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O'Boyle
Compiler architects increasingly look to machine learning when building heuristics for compiler optimization.
1 code implementation • 30 Jun 2020 • Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler
Transformers are one of the most important machine learning workloads today.
1 code implementation • 18 May 2020 • Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, Torsten Hoefler
Applied to global data, our mixed models achieve a relative improvement in ensemble forecast skill (CRPS) of over 14%.
no code implementations • 30 Apr 2020 • Shigang Li, Tal Ben-Nun, Giorgi Nadiradze, Salvatore Di Girolamo, Nikoli Dryden, Dan Alistarh, Torsten Hoefler
For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
2 code implementations • 23 Mar 2020 • Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather
We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs.
no code implementations • 2 Nov 2019 • Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler
Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations.
2 code implementations • 12 Aug 2019 • Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten Hoefler, Daniel Soudry
Although trained on images of aspecific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps.
no code implementations • 12 Aug 2019 • Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself.
3 code implementations • 27 Feb 2019 • Tal Ben-Nun, Johannes De Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler
With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science.
Programming Languages Distributed, Parallel, and Cluster Computing Performance
no code implementations • 25 Feb 2019 • Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, Torsten Hoefler
To facilitate understanding of this emerging domain, we present the first survey and taxonomy on graph computations on FPGAs.
Distributed, Parallel, and Cluster Computing Hardware Architecture
1 code implementation • 29 Jan 2019 • Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, Torsten Hoefler
We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques.
1 code implementation • 27 Jan 2019 • Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, Daniel Soudry
We analyze the effect of batch augmentation on gradient variance and show that it empirically improves convergence for a wide variety of deep neural networks and datasets.
1 code implementation • NeurIPS 2018 • Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler
In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks.
1 code implementation • 13 Apr 2018 • Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka
NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning.
1 code implementation • 26 Feb 2018 • Tal Ben-Nun, Torsten Hoefler
We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning.