Search Results for author: Rajhans Samdani

Found 6 papers, 0 papers with code

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

no code implementations3 Sep 2024 Yuxiang Wei, Hojae Han, Rajhans Samdani

Focusing on the code domain, we introduce Arctic-SnowCoder-1. 3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3. 1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining.

Code Generation HumanEval

Cannot find the paper you are looking for? You can Submit a new open access paper.