Search Results for author: Jiarui Fang

Found 12 papers, 6 papers with code

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

no code implementations16 Mar 2019 Jiarui Fang, Liandeng Li, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, Guangwen Yang

Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe.

TurboTransformers: An Efficient GPU Serving System For Transformer Models

no code implementations9 Oct 2020 Jiarui Fang, Yang Yu, Chengduo Zhao, Jie zhou

This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges.

Management

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

1 code implementation28 Oct 2021 Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, Yang You

The success of Transformer models has pushed the deep learning model scale to billions of parameters.

A Frequency-aware Software Cache for Large Recommendation System Embeddings

1 code implementation8 Aug 2022 Jiarui Fang, Geng Zhang, Jiatong Han, Shenggui Li, Zhengda Bian, Yongbin Li, Jin Liu, Yang You

Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

no code implementations6 Sep 2022 Jiangsu Du, Ziming Liu, Jiarui Fang, Shenggui Li, Yongbin Li, Yutong Lu, Yang You

Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints.

Blocking

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

1 code implementation6 Feb 2023 Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang You

To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans.

Scheduling

AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference

no code implementations19 Jan 2024 Xuanlei Zhao, Shenggan Cheng, Guangyang Lu, Jiarui Fang, Haotian Zhou, Bin Jia, Ziming Liu, Yang You

The experiments demonstrate that AutoChunk can reduce over 80\% of activation memory while maintaining speed loss within 10%, extend max sequence length by 3. 2x to 11. 7x, and outperform state-of-the-art methods by a large margin.

Code Generation

Cannot find the paper you are looking for? You can Submit a new open access paper.