Search Results for author: Yuxiong He

Found 50 papers, 30 papers with code

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

2 code implementations25 Jan 2024 Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference.

Quantization

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

2 code implementations9 Jan 2024 Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems.

Benchmarking Text Generation

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

no code implementations6 Oct 2023 Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov, Mohammed AlQuraishi, Gustaf Ahdritz, Christina Floristean, Cristina Negri, Rao Kotamarthi, Venkatram Vishwanath, Arvind Ramanathan, Sam Foreman, Kyle Hippe, Troy Arcomano, Romit Maulik, Maxim Zvyagin, Alexander Brace, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, Carla M. Mann, Michael Irvin, J. Gregory Pauloski, Logan Ward, Valerie Hayot, Murali Emani, Zhen Xie, Diangen Lin, Maulik Shukla, Ian Foster, James J. Davis, Michael E. Papka, Thomas Brettin, Prasanna Balaprakash, Gina Tourassi, John Gounley, Heidi Hanson, Thomas E Potok, Massimiliano Lupo Pasini, Kate Evans, Dan Lu, Dalton Lunga, Junqi Yin, Sajal Dash, Feiyi Wang, Mallikarjun Shankar, Isaac Lyngaas, Xiao Wang, Guojing Cong, Pei Zhang, Ming Fan, Siyan Liu, Adolfy Hoisie, Shinjae Yoo, Yihui Ren, William Tang, Kyle Felker, Alexey Svyatkovskiy, Hang Liu, Ashwin Aji, Angela Dalton, Michael Schulte, Karl Schulz, Yuntian Deng, Weili Nie, Josh Romero, Christian Dallago, Arash Vahdat, Chaowei Xiao, Thomas Gibbs, Anima Anandkumar, Rick Stevens

In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences.

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

2 code implementations25 Sep 2023 Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He

Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms.

Language Modelling

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

1 code implementation25 Sep 2023 Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length.

Language Modelling Large Language Model

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

1 code implementation19 Jul 2023 Xiaoxia Wu, Zhewei Yao, Yuxiong He

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge.

Computational Efficiency Quantization

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

1 code implementation16 Jun 2023 Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability.

Quantization

Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important?

no code implementations16 May 2023 Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He

This study examines the impact of optimizing the Stable Diffusion (SD) guided inference pipeline.

Denoising

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

2 code implementations15 Mar 2023 Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs).

Quantization

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

no code implementations15 Mar 2023 Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda

However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales.

Scaling Vision-Language Models with Sparse Mixture of Experts

no code implementations13 Mar 2023 Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs).

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

1 code implementation11 Mar 2023 Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele

Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs.

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

1 code implementation27 Jan 2023 Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost.

Quantization

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

1 code implementation7 Dec 2022 Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He

Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities.

Language Modelling Large Language Model

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

1 code implementation17 Nov 2022 Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e. g., CV and NLP.

BiFeat: Supercharge GNN Training via Graph Feature Quantization

1 code implementation29 Jul 2022 Yuxin Ma, Ping Gong, Jun Yi, Zhewei Yao, Cheng Li, Yuxiong He, Feng Yan

We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network.

Quantization

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

2 code implementations30 Jun 2022 Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He

DeepSpeed Inference reduces latency by up to 7. 3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1. 5x for throughput-oriented scenarios.

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

no code implementations30 Jun 2022 Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

Furthermore, we present and inexpensive, heuristic-driven search algorithm that identifies promising heterogeneous compression configurations that meet a compression ratio constraint.

Natural Language Understanding Quantization

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

1 code implementation4 Jun 2022 Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.

Knowledge Distillation Quantization

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

3 code implementations4 Jun 2022 Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.

Knowledge Distillation Quantization

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

1 code implementation12 Feb 2022 Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He

1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD.

Open-Ended Question Answering

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

no code implementations29 Jan 2022 Minjia Zhang, Niranjan Uma Naresh, Yuxiong He

In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks.

Natural Language Understanding

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

3 code implementations14 Jan 2022 Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model.

Model Compression

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement

1 code implementation NeurIPS 2021 Heyang Qin, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He

In this paper, we propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e. g., at a mini-batch level) that can achieve state-of-the-art performance with record breaking batch sizes.

Computational Efficiency

NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM

no code implementations NeurIPS 2021 Connor Holmes, Minjia Zhang, Yuxiong He, Bo Wu

In particular, we propose to formulate the NxM sparsity as a constrained optimization problem and use Alternating Direction Method of Multipliers (ADMM) to optimize the downstream tasks while taking the underlying hardware constraints into consideration.

Knowledge Distillation Natural Language Understanding

ScaLA: Speeding-Up Fine-tuning of Pre-trained Transformer Networks via Efficient and Scalable Adversarial Perturbation

no code implementations29 Sep 2021 Minjia Zhang, Niranjan Uma Naresh, Yuxiong He

To address this challenge, we propose ScaLA, a scalable and robust method for large-batch optimization of transformer networks via adversarial perturbation.

Demystifying Hyperparameter Optimization in Federated Learning

no code implementations29 Sep 2021 Syed Zawad, Jun Yi, Minjia Zhang, Cheng Li, Feng Yan, Yuxiong He

Such data heterogeneity and privacy requirements bring unique challenges for learning hyperparameter optimization as the training dynamics change across clients even within the same training round and they are difficult to measure due to privacy constraints.

Federated Learning Hyperparameter Optimization +1

HoloFormer: Deep Compression of Pre-Trained Transforms via Unified Optimization of N:M Sparsity and Integer Quantization

no code implementations29 Sep 2021 Minjia Zhang, Connor Holmes, Yuxiong He, Bo Wu

In this work, we propose a unified, systematic approach to learning N:M sparsity and integer quantization for pre-trained Transformers using the Alternating Directions Method of Multipliers (ADMM).

Quantization

Scalable and Efficient MoE Training for Multitask Multilingual Models

1 code implementation22 Sep 2021 Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, Hany Hassan Awadalla

By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy.

Machine Translation Text Generation

The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

1 code implementation13 Aug 2021 Conglong Li, Minjia Zhang, Yuxiong He

To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate.

LAMBADA Text Generation

LEAP: Learnable Pruning for Transformer-based Models

1 code implementation30 May 2021 Zhewei Yao, Xiaoxia Wu, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney, Yuxiong He

Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively.

QQP

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

no code implementations16 Apr 2021 Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He

It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists.

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

1 code implementation13 Apr 2021 Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He

To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression.

8k

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

2 code implementations4 Feb 2021 Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He

One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression.

ZeRO-Offload: Democratizing Billion-Scale Model Training

3 code implementations18 Jan 2021 Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He

By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.

Computational Efficiency

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

no code implementations26 Aug 2020 Hanlin Tang, Shaoduo Gan, Samyam Rajbhandari, Xiangru Lian, Ji Liu, Yuxiong He, Ce Zhang

Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet.

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Network

no code implementations4 Nov 2019 Reza Yazdani, Olatunji Ruwase, Minjia Zhang, Yuxiong He, Jose-Maria Arnau, Antonio Gonzalez

To solve these issues, we propose an intelligent tiled-based dispatching mechanism for increasing the adaptiveness of RNN computation, in order to efficiently handle the data dependencies.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

10 code implementations4 Oct 2019 Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging.

Cross-Lingual Document Classification Image Generation +1

Learning to Anneal and Prune Proximity Graphs for Similarity Search

no code implementations25 Sep 2019 Minjia Zhang, Wenhan Wang, Yuxiong He

This paper studies similarity search, which is a crucial enabler of many feature vector--based applications.

Stochastic Optimization

Zoom: SSD-based Vector Search for Optimizing Accuracy, Latency and Memory

no code implementations11 Sep 2018 Minjia Zhang, Yuxiong He

With the advancement of machine learning and deep learning, vector search becomes instrumental to many information retrieval systems, to search and find best matches to user queries based on their semantic similarities. These online services require the search architecture to be both effective with high accuracy and efficient with low latency and memory footprint, which existing work fails to offer.

Information Retrieval Retrieval

Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

no code implementations NeurIPS 2018 Minjia Zhang, Xiaodong Liu, Wenhan Wang, Jianfeng Gao, Yuxiong He

Neural language models (NLMs) have recently gained a renewed interest by achieving state-of-the-art performance across many natural language processing (NLP) tasks.

Language Modelling Machine Translation +1

Learning Intrinsic Sparse Structures within Long Short-Term Memory

no code implementations ICLR 2018 Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li

This work aims to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs.

Language Modelling Model Compression +1

Cannot find the paper you are looking for? You can Submit a new open access paper.