Search Results for author: Shuaiwen Leon Song

Found 14 papers, 8 papers with code

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

1 code implementation19 Sep 2023 Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

Therefore, we propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive Tensor Cores.

COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression

1 code implementation18 Nov 2021 Sian Jin, Chengming Zhang, Xintong Jiang, Yunhe Feng, Hui Guan, Guanpeng Li, Shuaiwen Leon Song, Dingwen Tao

In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training, to allow training larger models or to accelerate training.

Data Compression

Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving

no code implementations7 Oct 2021 Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, Xin Fu

Bayesian Neural Networks (BNNs) that possess a property of uncertainty estimation have been increasingly adopted in a wide range of safety-critical AI applications which demand reliable and robust decision making, e. g., self-driving, rescue robots, medical image diagnosis.

Decision Making

Dr. Top-k: Delegate-Centric Top-k on GPUs

1 code implementation16 Sep 2021 Anil Gaihre, Da Zheng, Scott Weitze, Lingda Li, Shuaiwen Leon Song, Caiwen Ding, Xiaoye S Li, Hang Liu

Recent top-$k$ computation efforts explore the possibility of revising various sorting algorithms to answer top-$k$ queries on GPUs.

Randomness In Neural Network Training: Characterizing The Impact of Tooling

1 code implementation22 Jun 2021 Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, Sara Hooker

However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e. g., with overhead up to $746\%$, $241\%$, and $196\%$ on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training.

ClickTrain: Efficient and Accurate End-to-End Deep Learning Training via Fine-Grained Architecture-Preserving Pruning

no code implementations20 Nov 2020 Chengming Zhang, Geng Yuan, Wei Niu, Jiannan Tian, Sian Jin, Donglin Zhuang, Zhe Jiang, Yanzhi Wang, Bin Ren, Shuaiwen Leon Song, Dingwen Tao

Moreover, compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time.

A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

no code implementations18 Nov 2020 Sian Jin, Guanpeng Li, Shuaiwen Leon Song, Dingwen Tao

In this paper, we propose a novel memory-driven high performance DNN training framework that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger networks.

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

2 code implementations9 Feb 2020 Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, Dingwen Tao

Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input.

Distributed, Parallel, and Cluster Computing

Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design

no code implementations7 Nov 2019 Xingyao Zhang, Shuaiwen Leon Song, Chenhao Xie, Jing Wang, Weigong Zhang, Xin Fu

In recent years, the CNNs have achieved great successes in the image processing tasks, e. g., image recognition and object detection.

Image Segmentation object-detection +2

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

1 code implementation11 Mar 2019 Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, Kevin Barker

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations.

Hardware Architecture Distributed, Parallel, and Cluster Computing Networking and Internet Architecture Performance

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

no code implementations13 Jan 2018 Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, Tim Kraska

Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance.

Management Scheduling

Cannot find the paper you are looking for? You can Submit a new open access paper.