Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

7 Mar 2024  ยท  Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung ยท

We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.

PDF Abstract

Datasets


Introduced in the Paper:

SAFIM

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Code Completion SAFIM deepseek-coder-33b-base Algorithmic 60.78 # 1
Average 69.01 # 1
Control 71.10 # 1
API 75.16 # 1
Code Completion SAFIM gpt-3.5-turbo-0301 Algorithmic 31.24 # 9
Average 40.86 # 9
Control 37.48 # 9
API 53.87 # 8
Code Completion SAFIM deepseek-coder-6.7b-base Algorithmic 54.74 # 2
Average 63.40 # 2
Control 65.79 # 2
API 69.68 # 2
Code Completion SAFIM deepseek-coder-1.3b-base Algorithmic 41.20 # 6
Average 52.63 # 6
Control 54.10 # 6
API 62.58 # 4
Code Completion SAFIM starcoderbase Algorithmic 44.11 # 3
Average 55.54 # 3
Control 54.46 # 5
API 68.06 # 3
Code Completion SAFIM CodeLlama-34b-hf Algorithmic 38.55 # 7
Average 49.66 # 7
Control 53.98 # 7
API 56.45 # 7
Code Completion SAFIM CodeLlama-13b-hf Algorithmic 41.41 # 5
Average 52.78 # 5
Control 57.25 # 3
API 59.68 # 6
Code Completion SAFIM CodeLlama-7b-hf Algorithmic 34.68 # 8
Average 45.00 # 8
Control 53.56 # 8
API 46.77 # 10
Code Completion SAFIM incoder-6B Algorithmic 25.16 # 11
Average 33.79 # 10
Control 28.16 # 13
API 48.06 # 9
Code Completion SAFIM incoder-1B Algorithmic 21.06 # 14
Average 29.27 # 13
Control 22.89 # 15
API 43.87 # 11
Code Completion SAFIM codegen-16B-multi Algorithmic 25.94 # 10
Average 30.99 # 11
Control 35.74 # 10
API 31.29 # 13
Code Completion SAFIM codegen-6B-multi Algorithmic 23.60 # 12
Average 28.71 # 14
Control 34.80 # 11
API 27.74 # 14
Code Completion SAFIM codegen-2B-multi Algorithmic 23.49 # 13
Average 29.55 # 12
Control 32.89 # 12
API 32.26 # 12
Code Completion SAFIM codegen-350M-multi Algorithmic 16.30 # 15
Average 22.94 # 15
Control 26.06 # 14
API 26.45 # 15
Code Completion SAFIM gpt-4-1106-preview Algorithmic 42.11 # 4
Average 53.28 # 4
Control 55.15 # 4
API 62.58 # 4

Methods


No methods listed for this paper. Add relevant methods here