TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Code Generation	APPS	GPT-Neo 2.7B	Introductory Pass@1	3.90%	# 9
Code Generation	APPS	GPT-Neo 2.7B	Interview Pass@1	0.57%	# 9
Code Generation	APPS	GPT-Neo 2.7B	Competition Pass@1	0.00%	# 9
Code Generation	APPS	GPT-Neo 2.7B	Competition Pass@5	0.00%	# 5
Code Generation	APPS	GPT-Neo 2.7B	Interview Pass@5	0.80%	# 5
Code Generation	APPS	GPT-Neo 2.7B	Introductory Pass@5	5.50%	# 5
Code Generation	APPS	GPT-Neo 2.7B	Competition Pass@any	0.0%	# 8
Code Generation	APPS	GPT-Neo 2.7B	Interview Pass@any	0.80%	# 8
Code Generation	APPS	GPT-Neo 2.7B	Introductory Pass@any	5.50%	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/measuring-coding-challenge-competence-with/code-generation-on-apps)](https://paperswithcode.com/sota/code-generation-on-apps?p=measuring-coding-challenge-competence-with)`

Measuring Coding Challenge Competence With APPS

20 May 2021 · Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt ·

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.

PDF Abstract

Code

Add Remove Mark official

hendrycks/apps official

352

ncoop57/gpt-code-clippy

3,290

codedotal/gpt-code-clippy

3,290

Tasks

Add Remove

BIG-bench Machine Learning

Code Generation

Datasets

Introduced in the Paper:

APPS

Used in the Paper:

test

Results from the Paper

Edit

Ranked #8 on Code Generation on APPS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Code Generation	APPS	GPT-Neo 2.7B	Introductory Pass@1	3.90%	# 9	Compare
			Interview Pass@1	0.57%	# 9	Compare
			Competition Pass@1	0.00%	# 9	Compare
			Competition Pass@5	0.00%	# 5	Compare
			Interview Pass@5	0.80%	# 5	Compare
			Introductory Pass@5	5.50%	# 5	Compare
			Competition Pass@any	0.0%	# 8	Compare
			Interview Pass@any	0.80%	# 8	Compare
			Introductory Pass@any	5.50%	# 8	Compare

Methods

Add Remove

GPT-Neo

Edit Social Preview

Measuring Coding Challenge Competence With APPS

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove