1 code implementation • EMNLP 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael Mahoney
Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.
no code implementations • 16 May 2023 • Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He
This study examines the impact of optimizing the Stable Diffusion (SD) guided inference pipeline.
no code implementations • 15 Mar 2023 • Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs).
no code implementations • 13 Mar 2023 • Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He
The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs).
1 code implementation • 27 Jan 2023 • Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He
Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost.
1 code implementation • 7 Dec 2022 • Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He
Recent advances on deep learning models come at the price of formidable training cost.
1 code implementation • 17 Nov 2022 • Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He
Large-scale transformer models have become the de-facto architectures for various machine learning applications, e. g., CV and NLP.
1 code implementation • 29 Jul 2022 • Yuxin Ma, Ping Gong, Jun Yi, Zhewei Yao, Cheng Li, Yuxiong He, Feng Yan
We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network.
2 code implementations • 4 Jun 2022 • Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.
1 code implementation • 4 Jun 2022 • Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.
1 code implementation • 14 Jan 2022 • Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model.
1 code implementation • 8 Sep 2021 • Sheng Shen, Zhewei Yao, Douwe Kiela, Kurt Keutzer, Michael W. Mahoney
Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29. 45/17. 29 BLEU on IWSLT14/WMT14.
4 code implementations • 13 Jul 2021 • Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Ranked #4 on
Vision and Language Navigation
on RxR
(using extra training data)
1 code implementation • 30 May 2021 • Zhewei Yao, Xiaoxia Wu, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney, Yuxiong He
Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively.
2 code implementations • 29 Apr 2021 • Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, Joseph E. Gonzalez
On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss.
1 code implementation • 31 Mar 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Nicholas Lee, Patrick Wang, Aniruddha Nrusimha, Bohan Zhai, Tianren Gao, Michael W. Mahoney, Kurt Keutzer
End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 25 Mar 2021 • Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks.
1 code implementation • 22 Jan 2021 • Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, Kurt Keutzer
To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric for structured pruning.
3 code implementations • 5 Jan 2021 • Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks.
Natural Language Inference
Natural Language Understanding
+1
1 code implementation • 20 Nov 2020 • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
1 code implementation • NeurIPS 2020 • Jianfei Chen, Yu Gai, Zhewei Yao, Michael W. Mahoney, Joseph E. Gonzalez
We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance.
1 code implementation • EMNLP 2020 • Qinxin Wang, Hao Tan, Sheng Shen, Michael W. Mahoney, Zhewei Yao
Phrase localization is a task that studies the mapping from textual phrases to regions of an image.
1 code implementation • 26 Aug 2020 • Zhengming Zhang, Yaoqing Yang, Zhewei Yao, Yujun Yan, Joseph E. Gonzalez, Michael W. Mahoney
Replacing BN with the recently-proposed Group Normalization (GN) can reduce gradient diversity and improve test accuracy.
3 code implementations • 1 Jun 2020 • Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael W. Mahoney
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN.
1 code implementation • ICML 2020 • Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass.
Ranked #12 on
Machine Translation
on WMT2014 English-German
3 code implementations • CVPR 2020 • Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0. 5\% of one epoch training time of ResNet50 on ImageNet).
Ranked #1 on
Data Free Quantization
on CIFAR10
(CIFAR-10 W8A8 Top-1 Accuracy metric)
2 code implementations • 16 Dec 2019 • Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney
To illustrate this, we analyze the effect of residual connections and Batch Normalization layers on the trainability of neural networks.
1 code implementation • NeurIPS 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Joseph E. Gonzalez, Kurt Keutzer, Michael W. Mahoney, George Biros
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).
2 code implementations • NeurIPS 2020 • Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
However, the search space for a mixed-precision quantization is exponential in the number of layers.
no code implementations • 12 Sep 2019 • Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further.
no code implementations • 10 Jun 2019 • Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph Gonzalez, George Biros, Michael Mahoney
It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE).
no code implementations • 31 May 2019 • Kai Rothauge, Zhewei Yao, Zixi Hu, Michael W. Mahoney
We regard pre-trained residual networks (ResNets) as nonlinear systems and use linearization, a common method used in the qualitative analysis of nonlinear systems, to understand the behavior of the networks under small perturbations of the input images.
1 code implementation • ICCV 2019 • Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, Kurt Keutzer
Another challenge is a similar factorial complexity for determining block-wise fine-tuning order when quantizing the model to a target precision.
1 code implementation • 7 Apr 2019 • N. Benjamin Erichson, Zhewei Yao, Michael W. Mahoney
To complement these approaches, we propose a very simple and inexpensive strategy which can be used to ``retrofit'' a previously-trained network to improve its resilience to adversarial attacks.
no code implementations • 14 Mar 2019 • Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael W. Mahoney
In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch.
1 code implementation • 20 Feb 2019 • N. Benjamin Erichson, Lionel Mathelin, Zhewei Yao, Steven L. Brunton, Michael W. Mahoney, J. Nathan Kutz
In many applications, it is important to reconstruct a fluid flow field, or some other high-dimensional state, from limited measurements and limited data.
2 code implementations • CVPR 2019 • Zhewei Yao, Amir Gholami, Peng Xu, Kurt Keutzer, Michael Mahoney
To address this problem, we present a new family of trust region based adversarial attacks, with the goal of computing adversarial perturbations efficiently.
no code implementations • 4 Dec 2018 • Norman Mu, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney
We demonstrate the ability of our method to improve language modeling performance by up to 7. 91 perplexity and reduce training iterations by up to $61\%$, in addition to its flexibility in enabling snapshot ensembling and use with adversarial training.
Ranked #51 on
Natural Language Inference
on SNLI
no code implementations • 30 Nov 2018 • Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, Joseph Gonzalez
Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique.
1 code implementation • ICLR 2019 • Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, Michael Mahoney
Our method exceeds the performance of existing solutions in terms of both accuracy and the number of SGD iterations (up to 1\% and $5\times$, respectively).
6 code implementations • NeurIPS 2018 • Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, Michael W. Mahoney
Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum.