no code implementations • 30 Jun 2022 • Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu
Furthermore, we present and inexpensive, heuristic-driven search algorithm that identifies promising heterogeneous compression configurations that meet a compression ratio constraint.
1 code implementation • 30 Jun 2022 • Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He
DeepSpeed Inference reduces latency by up to 7. 3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1. 5x for throughput-oriented scenarios.
1 code implementation • 4 Jun 2022 • Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.
1 code implementation • 4 Jun 2022 • Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.
no code implementations • 26 Apr 2022 • John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, Guoqing Harry Xu
DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales.
1 code implementation • 12 Feb 2022 • Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He
1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD.
no code implementations • 29 Jan 2022 • Minjia Zhang, Niranjan Uma Naresh, Yuxiong He
In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks.
1 code implementation • 14 Jan 2022 • Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model.
no code implementations • 28 Nov 2021 • Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Xulong Tang, ChenChen Liu, Xiang Chen
With both scaling trends, new problems and challenges emerge in DL inference serving systems, which gradually trends towards Large-scale Deep learning Serving systems (LDS).
no code implementations • NeurIPS 2021 • Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu
In particular, we propose to formulate the NxM sparsity as a constrained optimization problem and use Alternating Direction Method of Multipliers (ADMM) to optimize the downstream tasks while taking the underlying hardware constraints into consideration.
no code implementations • 14 Oct 2021 • Soobee Lee, Minindu Weerakoon, Jonghyun Choi, Minjia Zhang, Di Wang, Myeongjae Jeon
In particular, in mobile and IoT devices, real-time data can be stored not just in high-speed RAMs but in internal storage devices as well, which offer significantly larger capacity than the RAMs.
no code implementations • 29 Sep 2021 • Syed Zawad, Jun Yi, Minjia Zhang, Cheng Li, Feng Yan, Yuxiong He
Such data heterogeneity and privacy requirements bring unique challenges for learning hyperparameter optimization as the training dynamics change across clients even within the same training round and they are difficult to measure due to privacy constraints.
no code implementations • 29 Sep 2021 • Minjia Zhang, Connor Holmes, Yuxiong He, Bo Wu
In this work, we propose a unified, systematic approach to learning N:M sparsity and integer quantization for pre-trained Transformers using the Alternating Directions Method of Multipliers (ADMM).
no code implementations • 29 Sep 2021 • Minjia Zhang, Niranjan Uma Naresh, Yuxiong He
To address this challenge, we propose ScaLA, a scalable and robust method for large-batch optimization of transformer networks via adversarial perturbation.
1 code implementation • 13 Aug 2021 • Conglong Li, Minjia Zhang, Yuxiong He
To reduce their expensive training cost, practitioners attempt to increase the batch sizes and learning rates.
no code implementations • 27 Jul 2021 • Dantong Zhu, Minjia Zhang
Our experiments give guidance on how to approximate and generalize MRNG to build proximity graphs on a large scale.
3 code implementations • 18 Jan 2021 • Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He
By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.
no code implementations • ICLR 2021 • Minjia Zhang, Menghao Li, Chi Wang, Mingqin Li
Recently, the DL compiler, together with Learning to Compile has proven to be a powerful technique for optimizing deep learning models.
no code implementations • NeurIPS 2020 • Menghao Li, Minjia Zhang, Chi Wang, Mingqin Li
Deep learning models are computationally intense, and implementations often have to be highly optimized by experts or hardware vendors to be usable in practice.
no code implementations • NeurIPS 2020 • Jie Ren, Minjia Zhang, Dong Li
The emergence of heterogeneous memory (HM) brings a solution to significantly increase memory capacity and break the above tradeoff: Using HM, billions of data points can be placed in the main memory on a single machine without using any data compression.
2 code implementations • NeurIPS 2020 • Minjia Zhang, Yuxiong He
Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains.
no code implementations • 4 Nov 2019 • Reza Yazdani, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, Antonio Gonzalez
To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies.
no code implementations • 25 Sep 2019 • Minjia Zhang, Wenhan Wang, Yuxiong He
This paper studies similarity search, which is a crucial enabler of many feature vector--based applications.
no code implementations • 11 Sep 2018 • Minjia Zhang, Yuxiong He
With the advancement of machine learning and deep learning, vector search becomes instrumental to many information retrieval systems, to search and find best matches to user queries based on their semantic similarities. These online services require the search architecture to be both effective with high accuracy and efficient with low latency and memory footprint, which existing work fails to offer.
no code implementations • NeurIPS 2018 • Minjia Zhang, Xiaodong Liu, Wenhan Wang, Jianfeng Gao, Yuxiong He
Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks.
no code implementations • ICLR 2018 • Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li
This work aims to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs.