In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

PDF Abstract

Results from the Paper


 Ranked #1 on on

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
# 1
Question Answering BoolQ LLaMA 2 70B (zero-shot) Accuracy 85 # 12
Question Answering BoolQ LLaMA 2 13B (zero-shot) Accuracy 81.7 # 18
Question Answering BoolQ LLaMA 2 7B (zero-shot) Accuracy 77.4 # 22
Question Answering BoolQ LLaMA 2 34B (zero-shot) Accuracy 83.7 # 14
Arithmetic Reasoning GSM8K LLaMA 2 70B (on-shot) Accuracy 56.8 # 56
Parameters (Billion) 70 # 35
Sentence Completion HellaSwag LLaMA 2 7B (zero-shot) Accuracy 77.2 # 27
Sentence Completion HellaSwag LLaMA 2 70B (zero-shot) Accuracy 85.3 # 9
Sentence Completion HellaSwag LLaMA 2 13B (zero-shot) Accuracy 80.7 # 21
Sentence Completion HellaSwag LLaMA 2 34B (zero-shot) Accuracy 83.3 # 15
Code Generation HumanEval LLaMA 2 (zero-shot) Pass@1 29.9 # 33
Math Word Problem Solving MAWPS LLaMA 2-Chat Accuracy (%) 82.4 # 14
Multi-task Language Understanding MMLU LLaMA 2 34B (few-shot, k=5) Average (%) 62.6 # 24
Multi-task Language Understanding MMLU LLaMA 2 13B (few-shot, k=5) Average (%) 54.8 # 32
Multi-task Language Understanding MMLU LLaMA 2 7B (few-shot, k=5) Average (%) 45.3 # 42
Multi-task Language Understanding MMLU LLaMA 2 70B (few-shot, k=5) Average (%) 68.9 # 18
Parameters (Billions) 70 # 33
Multiple Choice Question Answering (MCQA) MMLU (Professional medicine) Llama2-7B-chat Accuracy 40.07 # 6
Multiple Choice Question Answering (MCQA) MMLU (Professional medicine) Llama2-7B Accuracy 43.38 # 5
Question Answering Natural Questions LLaMA 2 70B (one-shot) EM 33.0 # 22
Question Answering PIQA LLaMA 2 7B (zero-shot) Accuracy 78.8 # 21
Question Answering PIQA LLaMA 2 34B (zero-shot) Accuracy 81.9 # 9
Question Answering PIQA LLaMA 2 70B (zero-shot) Accuracy 82.8 # 4
Question Answering PIQA LLaMA 2 13B (zero-shot) Accuracy 80.5 # 15
Question Answering PubChemQA Llama2-7B-chat BLEU-2 0.075 # 2
BLEU-4 0.009 # 2
ROUGE-1 0.184 # 2
ROUGE-2 0.043 # 2
ROUGE-L 0.142 # 2
MEATOR 0.149 # 2
Math Word Problem Solving SVAMP LLaMA 2-Chat Execution Accuracy 69.2 # 4
Question Answering TriviaQA LLaMA 2 70B (one-shot) EM 85 # 2
Question Answering UniProtQA Llama2-7B-chat BLEU-2 0.019 # 2
BLEU-4 0.002 # 2
ROUGE-1 0.103 # 2
ROUGE-2 0.060 # 2
ROUGE-L 0.009 # 2
MEATOR 0.052 # 2

Methods