8 dataset results for Program Repair

DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming errors using deep learning.

37 PAPERS • 1 BENCHMARK

ManySStuBs4J

The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques. We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes. These are single statement fixes, classified where possible into one of 16 syntactic templates which we call SStuBs. The dataset contains simple statement bugs mined from open-source Java projects hosted in GitHub. There are two variants of the dataset. One mined from the 100 Java Maven Projects and one mined from the top 1000 Java Projects.

22 PAPERS • NO BENCHMARKS YET

CodRep

Five curated datasets of one-liner commits from open-source projects. In total, they are composed of 58069 one-liner commits.

6 PAPERS • NO BENCHMARKS YET

xCodeEval

xCodeEval is one of the largest executable multilingual multitask benchmarks consisting of 17 programming languages with execution-level parallelism. It features a total of seven tasks involving code understanding, generation, translation, and retrieval, and it employs an execution-based evaluation instead of traditional lexical approaches. It also provides a test-case-based multilingual code execution engine, ExecEval that supports all the programming languages in xCodeEval.

6 PAPERS • NO BENCHMARKS YET

GitHub-Python

Repair AST parse (syntax) errors in Python code

4 PAPERS • 1 BENCHMARK

ETH Py150 Open

A massive, deduplicated corpus of 7.4M Python files from GitHub.

3 PAPERS • NO BENCHMARKS YET

Defects4J

Defects4J is a collection of reproducible bugs and a supporting infrastructure with the goal of advancing software engineering research.

2 PAPERS • NO BENCHMARKS YET

TFix's Code Patches Data

The dataset contains more than 100k code patch pairs extracted from open source projects on GitHub. Each pair comes with the erroneous and the fixed version of the corresponding code snippet. Instead of the whole file, the code snippets are extracted to focus on the problematic region (error line + other lines around it). For each sample, the repository name, the commit id, and the file names are provided so that one can access the complete files in case of interest.

1 PAPER • 1 BENCHMARK

Datasets

8 dataset results for Program Repair