Search Results for author: Jilong Xue

Found 6 papers, 3 papers with code

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

4 code implementations • 27 Feb 2024 • Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs).

Paper
Code

Retentive Network: A Successor to Transformer for Large Language Models

8 code implementations • 17 Jul 2023 • Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance.

Language Modelling

18,335

Paper
Code

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

no code implementations • 8 Apr 2023 • Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, Bin Cui

We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism.

Scheduling

Paper
Add Code

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

2 code implementations • 29 Dec 2021 • Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang, Bin Cui

Then it diversifies the experts and continues to train the MoE with a novel Dense-to-Sparse gate (DTS-Gate).

Language Modelling Machine Translation +1

Paper
Code

Towards Efficient Large-Scale Graph Neural Network Computing

no code implementations • 19 Oct 2018 • Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, Yafei Dai

This evolution has led to large graph-based irregular and sparse models that go beyond what existing deep learning frameworks are designed for.

graph partitioning Knowledge Graphs

Paper
Add Code

RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

no code implementations • 22 May 2018 • Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, Lidong Zhou

Its computation is typically characterized by a simple tensor data abstraction to model multi-dimensional matrices, a data-flow graph to model computation, and iterative executions with relatively frequent synchronizations, thereby making it substantially different from Map/Reduce style distributed big data computation.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.