Search Results for author: Alexander Heinecke

Found 15 papers, 7 papers with code

Microscaling Data Formats for Deep Learning

1 code implementation • 16 Oct 2023 • Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub, Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick, Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey, Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, Eric Chung

Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications.

Friction

100

Paper
Code

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

no code implementations • 25 Apr 2023 • Evangelos Georganas, Dhiraj Kalamkar, Kirill Voronin, Abhisek Kundu, Antonio Noack, Hans Pabst, Alexander Breuer, Alexander Heinecke

During the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts.

Paper
Add Code

FP8 Formats for Deep Learning

2 code implementations • 12 Sep 2022 • Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors.

Quantization

1,424

Paper
Code

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

no code implementations • 22 Apr 2022 • Rui Ma, Evangelos Georganas, Alexander Heinecke, Andrew Boutros, Eriko Nurvitadhi

The overhead of these collective communication operations in a distributed AI training system can bottleneck its performance, with more pronounced effects as the number of nodes increases.

Data Compression

Paper
Add Code

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning

1 code implementation • 16 Apr 2021 • Narendra Chaudhary, Sanchit Misra, Dhiraj Kalamkar, Alexander Heinecke, Evangelos Georganas, Barukh Ziv, Menachem Adelman, Bharat Kaul

Finally, we demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets and achieve up to 6. 86x speedup over the oneDNN library-based implementation on Cascade Lake CPUs.

Image Classification speech-recognition +1

797

Paper
Code

DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

no code implementations • 14 Apr 2021 • Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K. Ahmed, Sasikanth Avancha

Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.

graph partitioning

Paper
Add Code

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads

2 code implementations • 12 Apr 2021 • Evangelos Georganas, Dhiraj Kalamkar, Sasikanth Avancha, Menachem Adelman, Deepti Aggarwal, Cristina Anderson, Alexander Breuer, Jeremy Bruestle, Narendra Chaudhary, Abhisek Kundu, Denise Kutnick, Frank Laub, Vasimuddin Md, Sanchit Misra, Ramanarayan Mohanty, Hans Pabst, Brian Retford, Barukh Ziv, Alexander Heinecke

The TPP specification is platform-agnostic, thus code expressed via TPPs is portable, whereas the TPP implementation is highly-optimized and platform-specific.

797

Paper
Code

PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives

1 code implementation • 2 Jun 2020 • Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Gagandeep Goyal, Ramakrishna Upadrasta, Bharat Kaul

However, given the constant emergence of new DNN architectures, creating hand optimized code is expensive, slow and is not scalable.

speech-recognition Speech Recognition +2

Paper
Code

Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

2 code implementations • 10 May 2020 • Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, Alexander Heinecke

During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks.

Cloud Computing Recommendation Systems

797

Paper
Code

PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives

no code implementations • 6 Feb 2020 • Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Gagandeep Goyal, Ramakrishna Upadrasta, Bharat Kaul

In this paper, we develop a hybrid solution to the development of deep learning kernels that achieves the best of both worlds: the expert coded microkernels are utilized for the innermost loops of kernels and we use the advanced polyhedral technology to automatically tune the outer loops for performance.

Paper
Add Code

Training Neural Machine Translation (NMT) Models using Tensor Train Decomposition on TensorFlow (T3F)

no code implementations • 5 Nov 2019 • Amelia Drew, Alexander Heinecke

For the IWSLT English-Vietnamese training, we obtain BLEU test/dev scores of 24. 0/21. 9 and 24. 2/21. 9 using core dimensions $(2, 2, 256) \times (2, 2, 512)$ with learning rate 0. 0012 and rank distributions $(1, 4, 4, 1)$ and $(1, 4, 16, 1)$ respectively.

Machine Translation NMT +1

Paper
Add Code

High-Performance Deep Learning via a Single Building Block

no code implementations • 15 Jun 2019 • Evangelos Georganas, Kunal Banerjee, Dhiraj Kalamkar, Sasikanth Avancha, Anand Venkat, Michael Anderson, Greg Henry, Hans Pabst, Alexander Heinecke

Deep learning (DL) is one of the most prominent branches of machine learning.

Vocal Bursts Intensity Prediction

Paper
Add Code

A Study of BFLOAT16 for Deep Learning Training

no code implementations • 29 May 2019 • Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, Pradeep Dubey

In this paper, we discuss the flow of tensors and various key operations in mixed precision training, and delve into details of operations, such as the rounding modes for converting FP32 tensors to BFLOAT16.

Image Classification Language Modelling +3

Paper
Add Code

Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures

2 code implementations • 16 Aug 2018 • Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, Alexander Heinecke

Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition.

Distributed, Parallel, and Cluster Computing

797

Paper
Code

Mixed Precision Training of Convolutional Neural Networks using Integer Operations

no code implementations • ICLR 2018 • Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, Vadim Pirogov

The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular, FP16 accumulating into FP32 Micikevicius et al. (2017).

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.