3 code implementations • 2 Sep 2024 • Keer Lu, Xiaonan Nie, Zheng Liang, Da Pan, Shusen Zhang, Keshi Zhao, WeiPeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang
Through extensive experimental analysis, we identified three key challenges in designing effective data management strategies that enable the model to achieve long-context capability without sacrificing performance in other tasks: (1) a shortage of long documents across multiple domains, (2) effective construction of context windows, and (3) efficient organization of large-scale datasets.
no code implementations • 27 Aug 2024 • Guosheng Dong, Da Pan, Yiding Sun, Shusen Zhang, Zheng Liang, Xin Wu, Yanjun Shen, Fan Yang, Haoze Sun, Tianpeng Li, MingAn Lin, Jianhua Xu, Yufan Zhang, Xiaonan Nie, Lei Su, Bingning Wang, Wentao Zhang, Jiaxin Mao, Zenan Zhou, WeiPeng Chen
The general capabilities of Large Language Models (LLM) highly rely on the composition and selection on extensive pretraining datasets, treated as commercial secrets by several institutions.
no code implementations • 16 Jul 2024 • Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui
By leveraging fine-grained activation memory management, MEMO facilitates efficient training of 7B LLM with 1 million sequence length on just 8 A800 GPUs, achieving an MFU of 52. 30%.
no code implementations • 1 Jul 2024 • Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, WeiPeng Chen, Bin Cui
During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head.
1 code implementation • 1 May 2024 • Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, WeiPeng Chen, Bin Cui
Consequently, the GPU spends most of its time on memory transfer instead of computation.
1 code implementation • 5 Jul 2023 • Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models.
no code implementations • 8 Apr 2023 • Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, Bin Cui
We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism.
no code implementations • 6 Mar 2023 • Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui
Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially the Transformer models.
2 code implementations • 25 Nov 2022 • Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, Bin Cui
Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models.
2 code implementations • 29 Dec 2021 • Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang, Bin Cui
Then it diversifies the experts and continues to train the MoE with a novel Dense-to-Sparse gate (DTS-Gate).
3 code implementations • 14 Dec 2021 • Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, Bin Cui
Embedding models have been an effective learning paradigm for high-dimensional data.