1 code implementation • 13 Jun 2022 • Charbel Sakr, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, William J. Dally, Brucek Khailany
Data clipping is crucial in reducing noise in quantization operations and improving the achievable accuracy of quantization-aware training (QAT).
1 code implementation • 10 Mar 2021 • Huizi Mao, Sibo Zhu, Song Han, William J. Dally
Object recognition is a fundamental problem in many video processing tasks, accurately locating seen objects at low computation cost paves the way for on-device video recognition.
no code implementations • 8 Feb 2021 • Steve Dai, Rangharajan Venkatesan, Haoxing Ren, Brian Zimmer, William J. Dally, Brucek Khailany
4-bit weights and 8-bit activations achieve near-full-precision accuracy for both BERT-base and BERT-large on SQuAD while reducing area by 26% compared to an 8-bit baseline.
no code implementations • 20 Feb 2020 • Zhekai Zhang, Hanrui Wang, Song Han, William J. Dally
We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5. 4x.
Hardware Architecture Distributed, Parallel, and Cluster Computing
1 code implementation • ICCV 2019 • Huizi Mao, Xiaodong Yang, William J. Dally
Average precision (AP) is a widely used metric to evaluate detection accuracy of image and video object detectors.
1 code implementation • Design Automation Conference (DAC) 2019 • Angad S. Rekhi, Brian Zimmer, Nikola Nedovic, Ningxi Liu, Rangharajan Venkatesan, Miaorong Wang, Brucek Khailany, William J. Dally, C. Thomas Gray
We also introduce an energy model to predict the requirements of high-accuracy AMS hardware running large networks and use it to show that for ADC-dominated designs, there is a direct tradeoff between energy efficiency and network accuracy.
no code implementations • 30 Sep 2018 • Huizi Mao, Taeyoung Kong, William J. Dally
Experiments on the KITTI dataset show that CaTDet reduces operation count by 5. 1-8. 7x with the same mean Average Precision(mAP) as the single-model Faster R-CNN detector and incurs additional delay of 0. 3 frame.
1 code implementation • ICLR 2018 • Xingyu Liu, Jeff Pool, Song Han, William J. Dally
First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations.
3 code implementations • ICLR 2018 • Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally
The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections.
no code implementations • 24 May 2017 • Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally
Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.
no code implementations • 23 May 2017 • Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, William J. Dally
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning.
6 code implementations • 4 Dec 2016 • Chenzhuo Zhu, Song Han, Huizi Mao, William J. Dally
To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values.
no code implementations • 1 Dec 2016 • Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally
Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations.
2 code implementations • 15 Jul 2016 • Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally
We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance.
58 code implementations • 24 Feb 2016 • Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer
(2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car.
Ranked #1 on
Image Classification
on ImageNet-P
4 code implementations • 4 Feb 2016 • Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, William J. Dally
EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1. 88x10^4 frames/sec with a power dissipation of only 600mW.
15 code implementations • 1 Oct 2015 • Song Han, Huizi Mao, William J. Dally
To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
8 code implementations • NeurIPS 2015 • Song Han, Jeff Pool, John Tran, William J. Dally
On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6. 7 million, without incurring accuracy loss.