mbpp
64 papers with code • 1 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in mbpp
Trend | Dataset | Best Model | Paper | Code | Compare |
---|
Most implemented papers
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+.
ReCode: Robustness Evaluation of Code Generation Models
Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation.
EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization
To address this issue, we propose \textbf{EffiLearner}, a self-optimization framework that utilizes execution overhead profiles to improve the efficiency of LLM-generated code.
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL).
Underwater Object Tracker: UOSTrack for Marine Organism Grasping of Underwater Vehicles
The UOHT training paradigm is designed to train the sample-imbalanced underwater tracker so that the tracker is exposed to a great number of underwater domain training samples and learns the feature expressions.
Teaching Large Language Models to Self-Debug
In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i. e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language.
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback
Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution, and is compatible out-of-the-box with traditional seq2seq coding methods, while enabling the development of new methods for interactive code generation.
Code Llama: Open Foundation Models for Code
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks.
Clover: Closed-Loop Verifiable Code Generation
In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code.
Unsupervised Evaluation of Code LLMs with Round-Trip Correctness
To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains.