no code implementations • 19 Apr 2025 • Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance.
no code implementations • 24 Feb 2025 • Jiahui Peng, Xinlin Zhuang, Qiu Jiantao, Ren Ma, Jing Yu, Tianyi Bai, Conghui He
The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various domains, sources, and topics.
1 code implementation • 10 Oct 2024 • Tianyi Bai, Ling Yang, Zhen Hao Wong, Jiahui Peng, Xinlin Zhuang, Chi Zhang, Lijun Wu, Jiantao Qiu, Wentao Zhang, Binhang Yuan, Conghui He
Efficient data selection is crucial to accelerate the pretraining of large language models (LLMs).
1 code implementation • 28 May 2024 • Bin Wang, Linke Ouyang, Fan Wu, Wenchang Ning, Xiao Han, Zhiyuan Zhao, Jiahui Peng, Yiying Jiang, Dahua Lin, Conghui He
In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs.
no code implementations • 29 Apr 2024 • Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He
In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture.
1 code implementation • 4 Jan 2024 • Jiahui Peng, Jing Gao, Xin Tong, Jing Guo, Hang Yang, Jianchuan Qi, Ruiqiao Li, Nan Li, Ming Xu
In the evolving field of corporate sustainability, analyzing unstructured Environmental, Social, and Governance (ESG) reports is a complex challenge due to their varied formats and intricate content.
2 code implementations • 24 Aug 2023 • Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He
A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks.