Search Results for author: Minjia Zhang

Found 26 papers, 8 papers with code

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

no code implementations30 Jun 2022 Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

Furthermore, we present and inexpensive, heuristic-driven search algorithm that identifies promising heterogeneous compression configurations that meet a compression ratio constraint.

Natural Language Understanding Quantization

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

1 code implementation30 Jun 2022 Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He

DeepSpeed Inference reduces latency by up to 7. 3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1. 5x for throughput-oriented scenarios.

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

1 code implementation4 Jun 2022 Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.

Knowledge Distillation Quantization

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

1 code implementation4 Jun 2022 Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.

Knowledge Distillation Quantization

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

no code implementations26 Apr 2022 John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, Guoqing Harry Xu

DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales.

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

1 code implementation12 Feb 2022 Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD.

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

no code implementations29 Jan 2022 Minjia Zhang, Niranjan Uma Naresh, Yuxiong He

In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks.

Natural Language Understanding

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

1 code implementation14 Jan 2022 Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model.

Model Compression

A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities

no code implementations28 Nov 2021 Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Xulong Tang, ChenChen Liu, Xiang Chen

With both scaling trends, new problems and challenges emerge in DL inference serving systems, which gradually trends towards Large-scale Deep learning Serving systems (LDS).

NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM

no code implementations NeurIPS 2021 Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

In particular, we propose to formulate the NxM sparsity as a constrained optimization problem and use Alternating Direction Method of Multipliers (ADMM) to optimize the downstream tasks while taking the underlying hardware constraints into consideration.

Knowledge Distillation Natural Language Processing +2

Carousel Memory: Rethinking the Design of Episodic Memory for Continual Learning

no code implementations14 Oct 2021 Soobee Lee, Minindu Weerakoon, Jonghyun Choi, Minjia Zhang, Di Wang, Myeongjae Jeon

In particular, in mobile and IoT devices, real-time data can be stored not just in high-speed RAMs but in internal storage devices as well, which offer significantly larger capacity than the RAMs.

Continual Learning Management

Demystifying Hyperparameter Optimization in Federated Learning

no code implementations29 Sep 2021 Syed Zawad, Jun Yi, Minjia Zhang, Cheng Li, Feng Yan, Yuxiong He

Such data heterogeneity and privacy requirements bring unique challenges for learning hyperparameter optimization as the training dynamics change across clients even within the same training round and they are difficult to measure due to privacy constraints.

Federated Learning Hyperparameter Optimization +1

HoloFormer: Deep Compression of Pre-Trained Transforms via Unified Optimization of N:M Sparsity and Integer Quantization

no code implementations29 Sep 2021 Minjia Zhang, Connor Holmes, Yuxiong He, Bo Wu

In this work, we propose a unified, systematic approach to learning N:M sparsity and integer quantization for pre-trained Transformers using the Alternating Directions Method of Multipliers (ADMM).

Natural Language Processing Quantization

ScaLA: Speeding-Up Fine-tuning of Pre-trained Transformer Networks via Efficient and Scalable Adversarial Perturbation

no code implementations29 Sep 2021 Minjia Zhang, Niranjan Uma Naresh, Yuxiong He

To address this challenge, we propose ScaLA, a scalable and robust method for large-batch optimization of transformer networks via adversarial perturbation.

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

1 code implementation13 Aug 2021 Conglong Li, Minjia Zhang, Yuxiong He

To reduce their expensive training cost, practitioners attempt to increase the batch sizes and learning rates.

LAMBADA Text Generation

Understanding and Generalizing Monotonic Proximity Graphs for Approximate Nearest Neighbor Search

no code implementations27 Jul 2021 Dantong Zhu, Minjia Zhang

Our experiments give guidance on how to approximate and generalize MRNG to build proximity graphs on a large scale.

Mathematical Proofs

ZeRO-Offload: Democratizing Billion-Scale Model Training

3 code implementations18 Jan 2021 Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He

By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.

DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation

no code implementations ICLR 2021 Minjia Zhang, Menghao Li, Chi Wang, Mingqin Li

Recently, the DL compiler, together with Learning to Compile has proven to be a powerful technique for optimizing deep learning models.

Decision Making

AdaTune: Adaptive Tensor Program Compilation Made Efficient

no code implementations NeurIPS 2020 Menghao Li, Minjia Zhang, Chi Wang, Mingqin Li

Deep learning models are computationally intense, and implementations often have to be highly optimized by experts or hardware vendors to be usable in practice.

HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory

no code implementations NeurIPS 2020 Jie Ren, Minjia Zhang, Dong Li

The emergence of heterogeneous memory (HM) brings a solution to significantly increase memory capacity and break the above tradeoff: Using HM, billions of data points can be placed in the main memory on a single machine without using any data compression.

Data Compression Quantization

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Network

no code implementations4 Nov 2019 Reza Yazdani, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, Antonio Gonzalez

To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies.

Automatic Speech Recognition speech-recognition

Learning to Anneal and Prune Proximity Graphs for Similarity Search

no code implementations25 Sep 2019 Minjia Zhang, Wenhan Wang, Yuxiong He

This paper studies similarity search, which is a crucial enabler of many feature vector--based applications.

Stochastic Optimization

Zoom: SSD-based Vector Search for Optimizing Accuracy, Latency and Memory

no code implementations11 Sep 2018 Minjia Zhang, Yuxiong He

With the advancement of machine learning and deep learning, vector search becomes instrumental to many information retrieval systems, to search and find best matches to user queries based on their semantic similarities. These online services require the search architecture to be both effective with high accuracy and efficient with low latency and memory footprint, which existing work fails to offer.

Information Retrieval

Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

no code implementations NeurIPS 2018 Minjia Zhang, Xiaodong Liu, Wenhan Wang, Jianfeng Gao, Yuxiong He

Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks.

Language Modelling Machine Translation +2

Learning Intrinsic Sparse Structures within Long Short-Term Memory

no code implementations ICLR 2018 Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li

This work aims to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs.

Language Modelling Model Compression +1

Cannot find the paper you are looking for? You can Submit a new open access paper.