SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
SWE-bench lite is a subset of SWE-bench, which is curated to make evaluation less costly and more accessible. SWE-bench lite comprises 300 instances that have been sampled to be more self-contained, with a focus on evaluating functional bug fixes.
Paper | Code | Results | Date | Stars |
---|