RewardBench is a benchmark designed to evaluate the capabilities and safety of reward models, including those trained with Direct Preference Optimization (DPO). It serves as the first evaluation tool for reward models and provides valuable insights into their performance and reliability¹.
Here are the key components of RewardBench:
Common Inference Code: The repository includes common inference code for various reward models, such as Starling, PairRM, OpenAssistant, and more. These models can be evaluated using the provided tools¹.
Dataset and Evaluation: The RewardBench dataset consists of prompt-win-lose trios spanning chat, reasoning, and safety scenarios. It allows benchmarking reward models on challenging, structured, and out-of-distribution queries. The goal is to enhance scientific understanding of reward models and their behavior².
Scripts for Evaluation:
scripts/run_rm.py
: Used to evaluate individual reward models.scripts/run_dpo.py
: Used to evaluate direct preference optimization (DPO) models.scripts/train_rm.py
: A basic reward model training script built on TRL (Transformer Reinforcement Learning)¹.Installation and Usage:
pip install -e .
.HF_TOKEN
with your token.Remember that RewardBench provides a standardized way to assess reward models, ensuring transparency and comparability across different approaches. 🌟🔍
(1) GitHub - allenai/reward-bench: RewardBench: the first evaluation tool .... https://github.com/allenai/reward-bench. (2) RewardBench: Evaluating Reward Models for Language Modeling. https://arxiv.org/abs/2403.13787. (3) RewardBench: Evaluating Reward Models for Language Modeling. https://paperswithcode.com/paper/rewardbench-evaluating-reward-models-for.
Paper | Code | Results | Date | Stars |
---|