no code implementations • ECCV 2020 • Zhijian Liu, Zhanghao Wu, Chuang Gan, Ligeng Zhu, Song Han
Third, our solution is extit{efficient} on the edge since the majority of the workload is delegated to the cloud, and our mixing and de-mixing processes introduce very few extra computations.
1 code implementation • 3 May 2022 • Yihan Wang, Muyang Li, Han Cai, Wei-Ming Chen, Song Han
Inspired by this finding, we design LitePose, an efficient single-branch architecture for pose estimation, and introduce two simple approaches to enhance the capacity of LitePose, including Fusion Deconv Head and Large Kernel Convs.
no code implementations • 25 Apr 2022 • Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing and speech recognition.
no code implementations • 25 Apr 2022 • Zhijian Liu, Haotian Tang, Shengyu Zhao, Kevin Shao, Song Han
3D neural networks are widely used in real-world applications (e. g., AR/VR headsets, self-driving cars).
1 code implementation • 21 Apr 2022 • Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, Song Han
TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement.
2 code implementations • 26 Feb 2022 • Hanrui Wang, Zirui Li, Jiaqi Gu, Yongshan Ding, David Z. Pan, Song Han
Nevertheless, we find that due to the significant quantum errors (noises) on real machines, gradients obtained from naive parameter shift have low fidelity and thus degrading the training accuracy.
1 code implementation • 27 Dec 2021 • Nhuong Nguyen, Song Han
In this paper, we propose an Asynchronous Event-triggered Stochastic Gradient Descent (SGD) framework, called AET-SGD, to i) reduce the communication cost among the compute nodes, and ii) mitigate the impact of the delay.
no code implementations • NeurIPS 2021 • Ligeng Zhu, Hongzhou Lin, Yao Lu, Yujun Lin, Song Han
Federated Learning is an emerging direction in distributed machine learning that en-ables jointly training a model without sharing the data.
no code implementations • NeurIPS 2021 • Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han
We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.
no code implementations • 23 Nov 2021 • Alexander Amini, Tsun-Hsuan Wang, Igor Gilitschenski, Wilko Schwarting, Zhijian Liu, Song Han, Sertac Karaman, Daniela Rus
Simulation has the potential to transform the development of robust algorithms for mobile agents deployed in safety-critical scenarios.
no code implementations • 28 Oct 2021 • Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han
We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.
2 code implementations • 21 Oct 2021 • Hanrui Wang, Jiaqi Gu, Yongshan Ding, Zirui Li, Frederic T. Chong, David Z. Pan, Song Han
Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to PQC according to realistic noise models of quantum hardware.
no code implementations • ICLR 2022 • Han Cai, Chuang Gan, Ji Lin, Song Han
We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks.
no code implementations • 29 Sep 2021 • Wei Shi, Hanrui Wang, Jiaqi Gu, Mingjie Liu, David Z. Pan, Song Han, Nan Sun
Specifically, circuit optimizations under different variations are considered as a set of tasks.
no code implementations • 29 Sep 2021 • Hanrui Wang, Zirui Li, Jiaqi Gu, Yongshan Ding, David Z. Pan, Song Han
The results demonstrate that our on-chip training achieves over 90% and 60% accuracy for 2-class and 4-class image classification tasks.
1 code implementation • 27 Sep 2021 • Ji Lin, Chuang Gan, Kuan Wang, Song Han
Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8.
no code implementations • ICCV 2021 • Zhijian Liu, Simon Stent, Jie Li, John Gideon, Song Han
Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets.
2 code implementations • 22 Jul 2021 • Hanrui Wang, Yongshan Ding, Jiaqi Gu, Zirui Li, Yujun Lin, David Z. Pan, Frederic T. Chong, Song Han
Extensively evaluated with 12 QML and VQE benchmarks on 14 quantum computers, QuantumNAS significantly outperforms baselines.
no code implementations • 27 May 2021 • Yujun Lin, Mengtian Yang, Song Han
Data-driven, automatic design space exploration of neural accelerator architecture is desirable for specialization and productivity.
no code implementations • 20 May 2021 • Zhijian Liu, Alexander Amini, Sibo Zhu, Sertac Karaman, Song Han, Daniela Rus
On the other hand, increasing the robustness of these systems is also critical; however, even estimating the model's uncertainty is very challenging due to the cost of sampling-based methods.
1 code implementation • 10 Mar 2021 • Huizi Mao, Sibo Zhu, Song Han, William J. Dally
Object recognition is a fundamental problem in many video processing tasks, accurately locating seen objects at low computation cost paves the way for on-device video recognition.
1 code implementation • CVPR 2021 • Ji Lin, Richard Zhang, Frieder Ganz, Song Han, Jun-Yan Zhu
Generative adversarial networks (GANs) have enabled photorealistic image synthesis and editing.
Ranked #1 on
Image Generation
on FFHQ 128 x 128
no code implementations • 17 Dec 2020 • Hanrui Wang, Zhekai Zhang, Song Han
Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence.
1 code implementation • 2 Nov 2020 • Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, Song Han
To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.
no code implementations • 11 Aug 2020 • Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han
Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures.
2 code implementations • ECCV 2020 • Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, Song Han
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely.
Ranked #3 on
3D Semantic Segmentation
on SemanticKITTI
1 code implementation • NeurIPS 2020 • Han Cai, Chuang Gan, Ligeng Zhu, Song Han
Furthermore, combined with feature extractor adaptation, TinyTL provides 7. 3-12. 9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.
no code implementations • NeurIPS 2020 • Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han
Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones.
10 code implementations • NeurIPS 2020 • Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, Song Han
Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100.
Ranked #1 on
Image Generation
on CIFAR-10 (20% data)
1 code implementation • CVPR 2020 • Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Song Han
However, training this quantization-aware accuracy predictor requires collecting a large number of quantized <model, accuracy> pairs, which involves quantization-aware finetuning and thus is highly time-consuming.
4 code implementations • ACL 2020 • Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han
To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search.
1 code implementation • 16 May 2020 • Zhongxia Yan, Hanrui Wang, Demi Guo, Song Han
In this paper, we provide the winning solution to the NeurIPS 2019 MicroNet Challenge in the language modeling track.
1 code implementation • ICLR 2020 • Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
Most of the traditional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally expensive and unscalable.
no code implementations • 30 Apr 2020 • Hanrui Wang, Kuan Wang, Jiacheng Yang, Linxiao Shen, Nan Sun, Hae-Seung Lee, Song Han
Our transferable optimization method makes transistor sizing and design porting more effective and efficient.
2 code implementations • ICLR 2020 • Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han
For language modeling, Lite Transformer achieves 1. 8 lower perplexity than the transformer at around 500M MACs.
1 code implementation • CVPR 2020 • Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan Zhu, Song Han
Directly applying existing compression methods yields poor performance due to the difficulty of GAN training and the differences in generator architectures.
no code implementations • 20 Feb 2020 • Zhekai Zhang, Hanrui Wang, Song Han, William J. Dally
We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5. 4x.
Hardware Architecture Distributed, Parallel, and Cluster Computing
1 code implementation • NeurIPS 2019 • Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, Ravichandra Addanki, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, Frank Cangialosi, Shaileshh Venkatakrishnan, Wei-Hung Weng, Song Han, Tim Kraska, Dr.Mohammad Alizadeh
We present Park, a platform for researchers to experiment with Reinforcement Learning (RL) for computer systems.
no code implementations • 1 Oct 2019 • Ji Lin, Chuang Gan, Song Han
With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74. 0%, which is 1. 6x and 2. 9x faster than previous 3D video models with higher accuracy.
no code implementations • 25 Sep 2019 • Ligeng Zhu, Yao Lu, Yujun Lin, Song Han
Traditional synchronous distributed training is performed inside a cluster, since it requires high bandwidth and low latency network (e. g. 25Gb Ethernet or Infini-band).
8 code implementations • 26 Aug 2019 • Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han
On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4. 0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1. 5x faster than MobileNetV3, 2. 6x faster than EfficientNet w. r. t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission.
Ranked #48 on
Neural Architecture Search
on ImageNet
3 code implementations • NeurIPS 2019 • Zhijian Liu, Haotian Tang, Yujun Lin, Song Han
The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution.
Ranked #1 on
3D Object Detection
on KITTI Pedestrian Easy val
5 code implementations • NeurIPS 2019 • Ligeng Zhu, Zhijian Liu, Song Han
Exchanging gradients is a widely used method in modern multi-node machine learning system (e. g., distributed training, collaborative learning).
no code implementations • 24 Apr 2019 • Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun Lin
Moreover, we shorten the design cycle by 200x than previous work, so that we can afford to design specialized neural network models for different hardware platforms.
no code implementations • ICLR 2019 • Ji Lin, Chuang Gan, Song Han
This paper aims to raise people's awareness about the security of the quantized models, and we designed a novel quantization methodology to jointly optimize the efficiency and robustness of deep learning models.
no code implementations • 29 Mar 2019 • Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael. I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek Murray, Kunle Olukotun, Dimitris Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Siddhartha Sen, Virginia Smith, Alex Smola, Dawn Song, Evan Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar
Machine learning (ML) techniques are enjoying rapidly increasing adoption.
no code implementations • 5 Dec 2018 • Hanrui Wang, Jiacheng Yang, Hae-Seung Lee, Song Han
We propose Learning to Design Circuits (L2DC) to leverage reinforcement learning that learns to efficiently generate new circuits data and to optimize circuits.
16 code implementations • ICLR 2019 • Han Cai, Ligeng Zhu, Song Han
We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set.
Ranked #6 on
Neural Architecture Search
on CIFAR-10 Image Classification
(using extra training data)
10 code implementations • CVPR 2019 • Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han
Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures.
9 code implementations • ICCV 2019 • Ji Lin, Chuang Gan, Song Han
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
Ranked #13 on
Video Object Detection
on ImageNet VID
no code implementations • NIPS Workshop CDNNRIA 2018 • Ting Chen, Ji Lin, Tian Lin, Song Han, Chong Wang, Denny Zhou
Modern deep neural networks have a large amount of weights, which make them difficult to deploy on computation constrained devices such as mobile phones.
3 code implementations • ICML 2018 • Han Cai, Jiacheng Yang, Wei-Nan Zhang, Song Han, Yong Yu
We introduce a new function-preserving transformation for efficient neural architecture search.
2 code implementations • 16 Apr 2018 • Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, Zhenbin Wu
For our example jet substructure model, we fit well within the available resources of modern FPGAs with a latency on the scale of 100 ns.
1 code implementation • ICLR 2018 • Xingyu Liu, Jeff Pool, Song Han, William J. Dally
First, we move the ReLU operation into the Winograd domain to increase the sparsity of the transformed activations.
11 code implementations • ECCV 2018 • Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han
Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets.
1 code implementation • The International Conference on Learning Representations 2017 • Yujun Lin, Song Han, Huizi Mao, Yu Wang, W. Dally
Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure.
2 code implementations • ICLR 2018 • Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally
The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections.
2 code implementations • 31 May 2017 • Morteza Mardani, Enhao Gong, Joseph Y. Cheng, Shreyas Vasanawala, Greg Zaharchuk, Marcus Alley, Neil Thakur, Song Han, William Dally, John M. Pauly, Lei Xing
A multilayer convolutional neural network is then jointly trained based on diagnostic quality images to discriminate the projection quality.
no code implementations • 24 May 2017 • Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally
Since memory reference is more than two orders of magnitude more expensive than arithmetic operations, the regularity of sparse structure leads to more efficient hardware design.
no code implementations • 8 Dec 2016 • Ioannis Papavasileiou, Wenlong Zhang, Xin Wang, Jinbo Bi, Li Zhang, Song Han
An advanced machine learning method, multi-task feature learning (MTFL), is used to jointly train classification models of a subject's gait in three classes, post-stroke, PD and healthy gait.
4 code implementations • 4 Dec 2016 • Chenzhuo Zhu, Song Han, Huizi Mao, William J. Dally
To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values.
no code implementations • 1 Dec 2016 • Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally
Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations.
2 code implementations • 15 Jul 2016 • Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally
We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance.
49 code implementations • 24 Feb 2016 • Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer
(2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car.
Ranked #14 on
Network Pruning
on ImageNet
no code implementations • 5 Feb 2016 • Shijian Tang, Song Han
Generating natural language descriptions for images is a challenging task.
4 code implementations • 4 Feb 2016 • Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, William J. Dally
EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1. 88x10^4 frames/sec with a power dissipation of only 600mW.
no code implementations • NeurIPS 2015 • Song Han, Jeff Pool, John Tran, William Dally
On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9×, from 61 million to 6. 7 million, without incurring accuracy loss.
14 code implementations • 1 Oct 2015 • Song Han, Huizi Mao, William J. Dally
To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
7 code implementations • NeurIPS 2015 • Song Han, Jeff Pool, John Tran, William J. Dally
On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6. 7 million, without incurring accuracy loss.
no code implementations • 11 Dec 2012 • Song Han, Jinsong Kim, Cholhun Kim, Jongchol Jo, Sunam Han
Face recognition systems must be robust to the variation of various factors such as facial expression, illumination, head pose and aging.