Search Results for author: Jiarui Fang

Found 14 papers, 8 papers with code

PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models

1 code implementation23 May 2024 Jiannan Wang, Jiarui Fang, Aoyu Li, Pengcheng Yang

This paper introduces PipeFusion, a novel approach that harnesses multi-GPU parallelism to address the high computational and latency challenges of generating high-resolution images with diffusion transformers (DiT) models.

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

2 code implementations13 May 2024 Jiarui Fang, Shangchun Zhao

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models.

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference

no code implementations19 Jan 2024 Xuanlei Zhao, Shenggan Cheng, Guangyang Lu, Jiarui Fang, Haotian Zhou, Bin Jia, Ziming Liu, Yang You

The experiments demonstrate that AutoChunk can reduce over 80\% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3. 2x to 11. 7x, and outperform state-of-the-art methods by a large margin.

Code Generation

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

1 code implementation6 Feb 2023 Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang You

To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans.


EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

no code implementations6 Sep 2022 Jiangsu Du, Ziming Liu, Jiarui Fang, Shenggui Li, Yongbin Li, Yutong Lu, Yang You

Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints.


A Frequency-aware Software Cache for Large Recommendation System Embeddings

1 code implementation8 Aug 2022 Jiarui Fang, Geng Zhang, Jiatong Han, Shenggui Li, Zhengda Bian, Yongbin Li, Jin Liu, Yang You

Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

1 code implementation28 Oct 2021 Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, Yang You

The success of Transformer models has pushed the deep learning model scale to billions of parameters.

TurboTransformers: An Efficient GPU Serving System For Transformer Models

no code implementations9 Oct 2020 Jiarui Fang, Yang Yu, Chengduo Zhao, Jie zhou

This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges.


swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

no code implementations16 Mar 2019 Jiarui Fang, Liandeng Li, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, Guangwen Yang

Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe.

Cannot find the paper you are looking for? You can Submit a new open access paper.