Search Results for author: Aamir Shafi

Found 6 papers, 4 papers with code

The Case for Co-Designing Model Architectures with Hardware

1 code implementation25 Jan 2024 Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models.

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

1 code implementation16 Jan 2024 Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

Unlike previous methods, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

1 code implementation22 May 2023 Jinghan Yao, Nawras Alnaasan, Tian Chen, Aamir Shafi, Hari Subramoni, Dhabaleswar K., Panda

Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens.

Computational Efficiency

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

no code implementations15 Mar 2023 Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda

However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales.

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

no code implementations20 Oct 2021 Nawras Alnaasan, Arpan Jain, Aamir Shafi, Hari Subramoni, Dhabaleswar K Panda

However, there is currently no benchmark suite to evaluate communication performance of mpi4py -- and Python MPI codes in general -- on modern HPC systems.

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

1 code implementation21 Jan 2021 Aamir Shafi, Jahanzeb Maqbool Hashmi, Hari Subramoni, Dhabaleswar K., Panda

This paper presents the design and implementation of a new communication backend for Dask -- called MPI4Dask -- that is targeted for modern HPC clusters built with GPUs.

Blocking Distributed Computing

Cannot find the paper you are looking for? You can Submit a new open access paper.