Instruction-Following Evaluation for Large Language Models

14 Nov 2023  ยท  Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou ยท

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

PDF Abstract

Datasets


Introduced in the Paper:

IFEval

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Instruction Following IFEval GPT-4 Prompt-level strict-accuracy 76.89 # 1
Inst-level strict-accuracy 83.57 # 1
Prompt-level loose-accuracy 79.3 # 1
Inst-level loose-accuracy 85.37 # 1
Instruction Following IFEval PaLM 2 S Prompt-level strict-accuracy 43.07 # 2
Inst-level strict-accuracy 55.76 # 2
Prompt-level loose-accuracy 46.95 # 2
Inst-level loose-accuracy 59.11 # 2

Methods


No methods listed for this paper. Add relevant methods here