HumanEval

Introduced by Chen et al. in Evaluating Large Language Models Trained on Code

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code". It used to measure functional correctness for synthesizing programs from docstrings. It consists of 164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Source: Evaluating Large Language Models Trained on Code

Homepage