SWE-bench

Introduced by Jimenez et al. in SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

Homepage