Search Results for author: Gennady Pekhimenko

Found 27 papers, 16 papers with code

Accelerating Graph Neural Networks on Real Processing-In-Memory Systems

no code implementations26 Feb 2024 Christina Giannoula, Peiming Yang, Ivan Fernandez Vega, Jiacheng Yang, Yu Xin Li, Juan Gomez Luna, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko

Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors.

Minuet: Accelerating 3D Sparse Convolutions on GPUs

1 code implementation1 Dec 2023 Jiacheng Yang, Christina Giannoula, Jun Wu, Mostafa Elhoushi, James Gleeson, Gennady Pekhimenko

Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads.

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

no code implementations28 Oct 2023 Qidong Su, Christina Giannoula, Gennady Pekhimenko

Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines.

Text Generation

Speeding up Fourier Neural Operators via Mixed Precision

1 code implementation27 Jul 2023 Colin White, Renbo Tu, Jean Kossaifi, Gennady Pekhimenko, Kamyar Azizzadenesheli, Anima Anandkumar

In this work, we (i) profile memory and runtime for FNO with full and mixed-precision training, (ii) conduct a study on the numerical stability of mixed-precision training of FNO, and (iii) devise a training routine which substantially decreases training time and memory usage (up to 34%), with little or no reduction in accuracy, on the Navier-Stokes and Darcy flow equations.

Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction

1 code implementation19 Oct 2022 Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, Gennady Pekhimenko

We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task.

Optimizing Data Collection in Deep Reinforcement Learning

no code implementations15 Jul 2022 James Gleeson, Daniel Snider, Yvonne Yang, Moshe Gabel, Eyal de Lara, Gennady Pekhimenko

We show that simulator kernel fusion speedups with a simple simulator are $11. 3\times$ and increase by up to $1024\times$ as simulator complexity increases in terms of memory bandwidth requirements.

reinforcement-learning Reinforcement Learning (RL)

RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

1 code implementation8 Feb 2021 James Gleeson, Srivatsan Krishnan, Moshe Gabel, Vijay Janapa Reddi, Eyal de Lara, Gennady Pekhimenko

Deep reinforcement learning (RL) has made groundbreaking advancements in robotics, data center management and other applications.

Management reinforcement-learning +1

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

2 code implementations3 Feb 2021 Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, Gennady Pekhimenko

Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years.

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

1 code implementation31 Jan 2021 Geoffrey X. Yu, Yubo Gao, Pavel Golikov, Gennady Pekhimenko

Our technique exploits the observation that, because DNN training consists of repetitive compute steps, predicting the execution time of a single iteration is usually enough to characterize the performance of an entire training process.

IOS: Inter-Operator Scheduler for CNN Acceleration

1 code implementation2 Nov 2020 Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, Song Han

To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.

FPRaker: A Processing Element For Accelerating Neural Network Training

no code implementations15 Oct 2020 Omar Mohamed Awad, Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Ciaran Bannon, Anand Jayarajan, Gennady Pekhimenko, Andreas Moshovos

We demonstrate that FPRaker can be used to compose an accelerator for training and that it can improve performance and energy efficiency compared to using conventional floating-point units under ISO-compute area constraints.

Quantization

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference

no code implementations1 Sep 2020 Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, Andreas Moshovos

TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams.

Skyline: Interactive In-Editor Computational Performance Profiling for Deep Neural Network Training

1 code implementation15 Aug 2020 Geoffrey X. Yu, Tovi Grossman, Gennady Pekhimenko

Training a state-of-the-art deep neural network (DNN) is a computationally-expensive and time-consuming process, which incentivizes deep learning developers to debug their DNNs for computational performance.

Multi-node Bert-pretraining: Cost-efficient Approach

no code implementations1 Aug 2020 Jiahuang Lin, Xin Li, Gennady Pekhimenko

As a result, to train these models within a reasonable time, machine learning (ML) programmers often require advanced hardware setups such as the premium GPU-enabled NVIDIA DGX workstations or specialized accelerators such as Google's TPU Pods.

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training

no code implementations5 Jun 2020 Hongyu Zhu, Amar Phanishayee, Gennady Pekhimenko

Modern deep neural network (DNN) training jobs use complex and heterogeneous software/hardware stacks.

BPPSA: Scaling Back-propagation by Parallel Scan Algorithm

1 code implementation23 Jul 2019 Shang Wang, Yifan Bai, Gennady Pekhimenko

In an era when the performance of a single compute device plateaus, software must be designed to scale on massively parallel systems for better runtime performance.

Priority-based Parameter Propagation for Distributed DNN Training

1 code implementation10 May 2019 Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, Gennady Pekhimenko

Data parallel training is widely used for scaling distributed deep neural network (DNN) training.

StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory

no code implementations4 Jan 2019 Hongyu Miao, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, Felix Xiaozhu Lin

It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM.

Databases

Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training

no code implementations22 May 2018 Bojian Zheng, Abhishek Tiwari, Nandita Vijaykumar, Gennady Pekhimenko

For each feature map recomputation to be effective and efficient, its effect on (1) the total memory footprint, and (2) the total execution time has to be carefully estimated.

Machine Translation NMT

TBD: Benchmarking and Analyzing Deep Neural Network Training

no code implementations16 Mar 2018 Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, Gennady Pekhimenko

Our primary goal in this work is to break this myopic view by (i) proposing a new benchmark for DNN training, called TBD (TBD is short for Training Benchmark for DNNs), that uses a representative set of DNN models that cover a wide range of machine learning applications: image classification, machine translation, speech recognition, object detection, adversarial networks, reinforcement learning, and (ii) by performing an extensive performance analysis of training these different applications on three major deep learning frameworks (TensorFlow, MXNet, CNTK) across different hardware configurations (single-GPU, multi-GPU, and multi-machine).

Benchmarking General Classification +6

Cannot find the paper you are looking for? You can Submit a new open access paper.