Search Results for author: Guohao Dai

Found 24 papers, 7 papers with code

DiTFastAttn: Attention Compression for Diffusion Transformer Models

no code implementations12 Jun 2024 Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on local information; 2. temporal redundancy, with high similarity between neighboring steps' attention outputs; 3. conditional redundancy, where conditional and unconditional inferences exhibit significant similarity.

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

no code implementations4 Jun 2024 Tianchen Zhao, Tongcheng Fang, Enshu Liu, Wan Rui, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions.

Quantization Video Generation

HetHub: A Heterogeneous distributed hybrid training system for large-scale models

no code implementations25 May 2024 Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Haolin Ye, Sipei Gu, Chunsheng Shui, Zhezheng Lin, Hao Zhang, Sheng Wang, Guohao Dai, Yu Wang

To address the problem, this paper proposes a distributed training system with hybrid parallelism support on heterogeneous GPU-accelerators for large-scale models.

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

1 code implementation23 May 2024 Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu

In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate ``weak-to-strong'' training strategy that pretrains DiM on low-resolution images ($256\times 256$) and then finetune it on high-resolution images ($512 \times 512$).

Image Generation

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

1 code implementation2 Apr 2024 Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

For example, LCSC achieves better performance using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality on CIFAR-10.

Evaluating Quantized Large Language Models

1 code implementation28 Feb 2024 Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs.


LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

1 code implementation6 Feb 2024 Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang

In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation.


FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

no code implementations8 Jan 2024 Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang

However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads.

Computational Efficiency Language Modelling +2

Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization

no code implementations28 Nov 2023 Jinhao Li, Shiyao Li, Jiaming Xu, Shan Huang, Yaoxiu Lian, Jun Liu, Yu Wang, Guohao Dai

Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (e. g. >3% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit).


FlashDecoding++: Faster Large Language Model Inference on GPUs

no code implementations2 Nov 2023 Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang

A single and static dataflow may lead to a 50. 25% performance loss for GEMMs of different shapes in LLM inference.

Language Modelling Large Language Model

TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

1 code implementation25 Oct 2023 Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhongming Yu, Xiuyu Li, Guohao Dai, Yu Wang, Song Han

On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads.

Autonomous Driving Recommendation Systems

Seeking the Yield Barrier: High-Dimensional SRAM Evaluation Through Optimal Manifold

no code implementations28 Jul 2023 Yanfang Liu, Guohao Dai, Wei W. Xing

We then generalize it with infinite components and derive the novel optimal manifold concept, which bridges the surrogate-based and importance sampling (IS) yield estimation methods.

Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection

no code implementations ICCV 2023 Tianchen Zhao, Xuefei Ning, Ke Hong, Zhongyuan Qiu, Pu Lu, Yali Zhao, Linfeng Zhang, Lipu Zhou, Guohao Dai, Huazhong Yang, Yu Wang

One reason for this high resource consumption is the presence of a large number of redundant background points in Lidar point clouds, resulting in spatial redundancy in both 3D voxel and dense BEV map representations.

3D Object Detection Autonomous Driving +1

Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

no code implementations31 May 2023 Yijia Zhang, Yibo Han, Shijie Cao, Guohao Dai, Youshan Miao, Ting Cao, Fan Yang, Ningyi Xu

We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients.

High-Dimensional Yield Estimation using Shrinkage Deep Features and Maximization of Integral Entropy Reduction

no code implementations5 Dec 2022 Shuo Yin, Guohao Dai, Wei W. Xing

Despite the fast advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of dimensionality, which is inevitable when dealing with modern large-scale circuits, remains unsolved.

Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective

no code implementations18 Oct 2021 Hengrui Zhang, Zhongming Yu, Guohao Dai, Guyue Huang, Yufei Ding, Yuan Xie, Yu Wang

The same data are propagated through the graph structure to perform the same neural operation multiple times in GNNs, leading to redundant computation which accounts for 92. 4% of total operators.

CogDL: A Comprehensive Library for Graph Deep Learning

1 code implementation1 Mar 2021 Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Zhongming Yu, Hengrui Zhang, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Yuxiao Dong, Yang Yang, Peng Zhang, Guohao Dai, Yu Wang, Chang Zhou, Hongxia Yang, Jie Tang

In CogDL, we propose a unified design for the training and evaluation of GNN models for various graph tasks, making it unique among existing graph learning libraries.

Graph Classification Graph Embedding +5

Explore the Potential of CNN Low Bit Training

no code implementations1 Jan 2021 Kai Zhong, Xuefei Ning, Tianchen Zhao, Zhenhua Zhu, Shulin Zeng, Guohao Dai, Yu Wang, Huazhong Yang

Through this dynamic precision framework, we can reduce the bit-width of convolution, which is the most computational cost, while keeping the training process close to the full precision floating-point training.


GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks

2 code implementations7 Jul 2020 Guyue Huang, Guohao Dai, Yu Wang, Huazhong Yang

GE-SpMM performs SpMM-like operation on sparse matrices represented in the most common Compressed Sparse Row (CSR) format, so it can be embedded in GNN frameworks with no preprocessing overheads and support general GNN algorithms.

Distributed, Parallel, and Cluster Computing

Exploring the Potential of Low-bit Training of Convolutional Neural Networks

no code implementations4 Jun 2020 Kai Zhong, Xuefei Ning, Guohao Dai, Zhenhua Zhu, Tianchen Zhao, Shulin Zeng, Yu Wang, Huazhong Yang

For training a variety of models on CIFAR-10, using 1-bit mantissa and 2-bit exponent is adequate to keep the accuracy loss within $1\%$.


Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

no code implementations26 Mar 2020 Shulin Zeng, Guohao Dai, Hanbo Sun, Kai Zhong, Guangjun Ge, Kaiyuan Guo, Yu Wang, Huazhong Yang

Currently, the majority of FPGA-based DNN accelerators in the cloud run in a time-division multiplexing way for multiple users sharing a single FPGA, and require re-compilation with $\sim$100 s overhead.

Cannot find the paper you are looking for? You can Submit a new open access paper.