no code implementations • 27 Feb 2025 • Hojae Han, Seung-won Hwang, Rajhans Samdani, Yuxiong He
Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings.
no code implementations • 3 Sep 2024 • Yuxiang Wei, Hojae Han, Rajhans Samdani
Focusing on the code domain, we introduce Arctic-SnowCoder-1. 3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3. 1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining.
no code implementations • 15 Jun 2019 • Rajhans Samdani, Pierre Rappolt, Ankit Goyal, Pratyus Patnaik
We present a system, Spoke, for creating and searching internal knowledge base (KB) articles for organizations.