1 code implementation • 2 Nov 2023 • Jianghao Chen, Pu Jian, Tengxiao Xi, Dongyi Yi, Qianlong Du, Chenglin Ding, Guibo Zhu, Chengqing Zong, Jinqiao Wang, Jiajun Zhang
Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1. 42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds.