Search Results for author: Baodong Wu

Found 1 papers, 1 papers with code

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

1 code implementation16 Oct 2023 Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, YuHeng Chen, Shigang Li

As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency.

Anomaly Detection

Cannot find the paper you are looking for? You can Submit a new open access paper.