TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Bug fixing	SWE-bench	SWE-Llama 13b	Resolved (unassisted)	0.70%	# 2
Bug fixing	SWE-bench	SWE-Llama 13b	Resolved (assisted)	4%	# 2
Bug fixing	SWE-bench	SWE-Llama 7b	Resolved (unassisted)	0.70%	# 2
Bug fixing	SWE-bench	SWE-Llama 7b	Resolved (assisted)	3%	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/swe-bench-can-language-models-resolve-real/bug-fixing-on-swe-bench)](https://paperswithcode.com/sota/bug-fixing-on-swe-bench?p=swe-bench-can-language-models-resolve-real)`

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

10 Oct 2023 · Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan ·

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Bug fixing

Code Generation

Language Modelling

Llama

Datasets

Introduced in the Paper:

SWE-bench

Used in the Paper:

HumanEval

Results from the Paper

Edit

Ranked #2 on Bug fixing on SWE-bench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Bug fixing	SWE-bench	SWE-Llama 13b	Resolved (unassisted)	0.70%	# 2	Compare
Bug fixing	SWE-bench	SWE-Llama 13b	Resolved (assisted)	4%	# 2	Compare
Bug fixing	SWE-bench	SWE-Llama 7b	Resolved (unassisted)	0.70%	# 2	Compare
Bug fixing	SWE-bench	SWE-Llama 7b	Resolved (assisted)	3%	# 3	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove