no code implementations • ACL 2022 • Hui Su, Weiwei Shi, Xiaoyu Shen, Zhou Xiao, Tuo ji, Jiarui Fang, Jie zhou
Large-scale pretrained language models have achieved SOTA results on NLP tasks.
1 code implementation • 6 Feb 2023 • Yuliang Liu, Shenggui Li, Jiarui Fang, Yanjun Shao, Boyuan Yao, Yang You
To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans.
no code implementations • 10 Dec 2022 • Haichen Huang, Jiarui Fang, Hongxin Liu, Shenggui Li, Yang You
People who are inaccessible to a large number of GPUs resort to heterogeneous training systems for storing model parameters in CPU memory.
no code implementations • 6 Sep 2022 • Jiangsu Du, Ziming Liu, Jiarui Fang, Shenggui Li, Yongbin Li, Yutong Lu, Yang You
Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints.
1 code implementation • 8 Aug 2022 • Jiarui Fang, Geng Zhang, Jiatong Han, Shenggui Li, Zhengda Bian, Yongbin Li, Jin Liu, Yang You
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
1 code implementation • 2 Mar 2022 • Shenggan Cheng, Xuanlei Zhao, Guangyang Lu, Jiarui Fang, Zhongming Yu, Tian Zheng, Ruidong Wu, Xiwen Zhang, Jian Peng, Yang You
In this work, we present FastFold, an efficient implementation of AlphaFold for both training and inference.
1 code implementation • 28 Oct 2021 • Shenggui Li, Jiarui Fang, Zhengda Bian, Hongxin Liu, Yuliang Liu, Haichen Huang, Boxiang Wang, Yang You
The success of Transformer models has pushed the deep learning model scale to billions of parameters.
1 code implementation • 12 Aug 2021 • Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie zhou, Yang You
PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data.
no code implementations • 9 Oct 2020 • Jiarui Fang, Yang Yu, Chengduo Zhao, Jie zhou
This paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework to solve the above challenges.
no code implementations • 16 Mar 2019 • Jiarui Fang, Liandeng Li, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, Guangwen Yang
Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe.
no code implementations • ICLR 2019 • Jiarui Fang, Haohuan Fu, Guangwen Yang, Cho-Jui Hsieh
Data parallelism has become a dominant method to scale Deep Neural Network (DNN) training across multiple nodes.