Search Results for author: Zhewei Yao

Found 50 papers, 38 papers with code

I-BERT: Integer-only BERT Quantization

4 code implementations • 5 Jan 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks.

Natural Language Inference Natural Language Understanding +1

124,772

Paper
Code

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

3 code implementations • 14 Jan 2022 • Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model.

Model Compression

32,567

Paper
Code

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

3 code implementations • 4 Jun 2022 • Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.

Knowledge Distillation Quantization

32,567

Paper
Code

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

1 code implementation • 4 Jun 2022 • Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.

Knowledge Distillation Quantization

32,567

Paper
Code

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

1 code implementation • 17 Nov 2022 • Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e. g., CV and NLP.

32,567

Paper
Code

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

1 code implementation • 7 Dec 2022 • Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He

Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities.

Language Modelling Large Language Model

32,567

Paper
Code

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

1 code implementation • 27 Jan 2023 • Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost.

Quantization

32,567

Paper
Code

ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

2 code implementations • 15 Mar 2023 • Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs).

Quantization

32,567

Paper
Code

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

1 code implementation • 19 Jul 2023 • Xiaoxia Wu, Zhewei Yao, Yuxiong He

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge.

Computational Efficiency Quantization

32,567

Paper
Code

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

1 code implementation • 2 Aug 2023 • Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He

ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance.

32,567

Paper
Code

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

2 code implementations • 14 Dec 2023 • Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao

With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.

Abstractive Text Summarization Code Generation +1

32,567

Paper
Code

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

2 code implementations • 25 Jan 2024 • Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference.

Quantization

32,567

Paper
Code

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

2 code implementations • 25 Sep 2023 • Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He

Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms.

Language Modelling

5,657

Paper
Code

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

6 code implementations • NeurIPS 2018 • Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, Michael W. Mahoney

Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum.

626

Paper
Code

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

2 code implementations • NeurIPS 2020 • Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

However, the search space for a mixed-precision quantization is exponential in the number of layers.

object-detection Object Detection +1

626

Paper
Code

PyHessian: Neural Networks Through the Lens of the Hessian

2 code implementations • 16 Dec 2019 • Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney

To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks.

626

Paper
Code

HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision

1 code implementation • ICCV 2019 • Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer

Another challenge is a similar factorial complexity for determining block-wise fine-tuning order when quantizing the model to a target precision.

Quantization

393

Paper
Code

HAWQV3: Dyadic Neural Network Quantization

1 code implementation • 20 Nov 2020 • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer

Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.

Model Compression Quantization

393

Paper
Code

How Much Can CLIP Benefit Vision-and-Language Tasks?

4 code implementations • 13 Jul 2021 • Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #4 on Vision and Language Navigation on RxR (using extra training data)

Question Answering Vision and Language Navigation +2

381

Paper
Code

ZeroQ: A Novel Zero Shot Quantization Framework

3 code implementations • CVPR 2020 • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0. 5\% of one epoch training time of ResNet50 on ImageNet).

Ranked #1 on Data Free Quantization on CIFAR10 (CIFAR-10 W8A8 Top-1 Accuracy metric)

Data Free Quantization Neural Network Compression

270

Paper
Code

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

3 code implementations • 1 Jun 2020 • Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney

We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN.

BIG-bench Machine Learning Second-order methods +1

246

Paper
Code

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

4 code implementations • 29 Apr 2021 • Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, Joseph E. Gonzalez

On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss.

Quantization

192

Paper
Code

Trust Region Based Adversarial Attack on Neural Networks

2 code implementations • CVPR 2019 • Zhewei Yao, Amir Gholami, Peng Xu, Kurt Keutzer, Michael Mahoney

To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently.

Adversarial Attack

133

Paper
Code

PowerNorm: Rethinking Batch Normalization in Transformers

1 code implementation • ICML 2020 • Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass.

Ranked #12 on Machine Translation on WMT2014 English-German

Machine Translation

119

Paper
Code

ANODEV2: A Coupled Neural ODE Framework

1 code implementation • NeurIPS 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E. Gonzalez, Kurt Keutzer, Michael W. Mahoney, George Biros

It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).

Paper
Code

Large batch size training of neural networks with adversarial training and second-order information

1 code implementation • ICLR 2019 • Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, Michael Mahoney

Our method exceeds the performance of existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $5\times$, respectively).

Second-order methods

Paper
Code

Improving Semi-supervised Federated Learning by Reducing the Gradient Diversity of Models

1 code implementation • 26 Aug 2020 • Zhengming Zhang, Yaoqing Yang, Zhewei Yao, Yujun Yan, Joseph E. Gonzalez, Michael W. Mahoney

Replacing BN with the recently-proposed Group Normalization (GN) can reduce gradient diversity and improve test accuracy.

Federated Learning

Paper
Code

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

1 code implementation • EMNLP 2020 • Qinxin Wang, Hao Tan, Sheng Shen, Michael W. Mahoney, Zhewei Yao

Phrase localization is a task that studies the mapping from textual phrases to regions of an image.

Phrase Grounding

Paper
Code

Hessian-Aware Pruning and Optimal Neural Implant

1 code implementation • 22 Jan 2021 • Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, Kurt Keutzer

To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric for structured pruning.

Paper
Code

Integer-only Zero-shot Quantization for Efficient Speech Recognition

1 code implementation • 31 Mar 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer

End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Code

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

1 code implementation • NeurIPS 2020 • Jianfei Chen, Yu Gai, Zhewei Yao, Michael W. Mahoney, Joseph E. Gonzalez

We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance.

Ranked #9 on Semantic Textual Similarity on STS Benchmark

Linguistic Acceptability Natural Language Inference +3

Paper
Code

LEAP: Learnable Pruning for Transformer-based Models

1 code implementation • 30 May 2021 • Zhewei Yao, Xiaoxia Wu, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney, Yuxiong He

Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively.

QQP

Paper
Code

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

1 code implementation • 5 Mar 2024 • Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, Zhangyang Wang

To address this problem, this paper introduces Multi-scale Positional Encoding (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of LLMs to handle the relevant information located in the middle of the context, without fine-tuning or introducing any additional overhead.

Language Modelling

Paper
Code

What's Hidden in a One-layer Randomly Weighted Transformer?

1 code implementation • 8 Sep 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael W. Mahoney

Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.

Machine Translation Translation

Paper
Code

What’s Hidden in a One-layer Randomly Weighted Transformer?

1 code implementation • EMNLP 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael Mahoney

Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.

Machine Translation Translation

Paper
Code

Shallow Neural Networks for Fluid Flow Reconstruction with Limited Sensors

1 code implementation • 20 Feb 2019 • N. Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L. Brunton, Michael W. Mahoney, J. Nathan Kutz

In many applications, it is important to reconstruct a fluid flow field, or some other high-dimensional state, from limited measurements and limited data.

Paper
Code

JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks

1 code implementation • 7 Apr 2019 • N. Benjamin Erichson, Zhewei Yao, Michael W. Mahoney

To complement these approaches, we propose a very simple and inexpensive strategy which can be used to ``retrofit'' a previously-trained network to improve its resilience to adversarial attacks.

Paper
Code

BiFeat: Supercharge GNN Training via Graph Feature Quantization

1 code implementation • 29 Jul 2022 • Yuxin Ma, Ping Gong, Jun Yi, Zhewei Yao, Cheng Li, Yuxiong He, Feng Yan

We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network.

Quantization

Paper
Code

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

no code implementations • 30 Nov 2018 • Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique.

Image Classification Image Segmentation +2

Paper
Add Code

Parameter Re-Initialization through Cyclical Batch Size Schedules

no code implementations • 4 Dec 2018 • Norman Mu, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney

We demonstrate the ability of our method to improve language modeling performance by up to 7. 91 perplexity and reduce training iterations by up to $61\%$, in addition to its flexibility in enabling snapshot ensembling and use with adversarial training.

Ranked #51 on Natural Language Inference on SNLI

General Classification Image Classification +2

Paper
Add Code

Inefficiency of K-FAC for Large Batch Size Training

no code implementations • 14 Mar 2019 • Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney

In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch.

Stochastic Optimization

Paper
Add Code

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

no code implementations • 31 May 2019 • Kai Rothauge, Zhewei Yao, Zixi Hu, Michael W. Mahoney

We regard pre-trained residual networks (ResNets) as nonlinear systems and use linearization, a common method used in the qualitative analysis of nonlinear systems, to understand the behavior of the networks under small perturbations of the input images.

Paper
Add Code

ANODEV2: A Coupled Neural ODE Evolution Framework

no code implementations • 10 Jun 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, Michael Mahoney

It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).

Paper
Add Code

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

no code implementations • 12 Sep 2019 • Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further.

Ranked #13 on Semantic Textual Similarity on STS Benchmark

Linguistic Acceptability Natural Language Inference +4

Paper
Add Code

A Survey of Quantization Methods for Efficient Neural Network Inference

no code implementations • 25 Mar 2021 • Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer

Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks.

Efficient Neural Network Quantization

Paper
Add Code

Scaling Vision-Language Models with Sparse Mixture of Experts

no code implementations • 13 Mar 2023 • Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs).

Paper
Add Code

Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important?

no code implementations • 16 May 2023 • Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He

This study examines the impact of optimizing the Stable Diffusion (SD) guided inference pipeline.

Denoising

Paper
Add Code

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

no code implementations • 2 Sep 2023 • Fengxiang Bie, Yibo Yang, Zhongzhu Zhou, Adam Ghanem, Minjia Zhang, Zhewei Yao, Xiaoxia Wu, Connor Holmes, Pareesa Golnari, David A. Clifton, Yuxiong He, DaCheng Tao, Shuaiwen Leon Song

Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.

3D Generation Text-to-Image Generation +1

Paper
Add Code

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers

no code implementations • 26 Oct 2023 • Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He

Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference.

Quantization

Paper
Add Code

AI and Memory Wall

no code implementations • 21 Mar 2024 • Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.