We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

PDF Abstract arXiv 2023 PDF arXiv 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning ARC (Challenge) LLaMA 13B (zero-shot) Accuracy 52.7 # 8
Common Sense Reasoning ARC (Challenge) LLaMA 65B (zero-shot) Accuracy 56.0 # 6
Common Sense Reasoning ARC (Challenge) LLaMA 33B (zero-shot) Accuracy 57.8 # 5
Common Sense Reasoning ARC (Challenge) LLaMA 7B (zero-shot) Accuracy 47.6 # 13
Common Sense Reasoning ARC (Easy) LLaMA 13B (zero-shot) Accuracy 74.8 # 5
Common Sense Reasoning ARC (Easy) LLaMA 33B (zero-shot) Accuracy 80.0 # 3
Common Sense Reasoning ARC (Easy) LLaMA 65B (zero-shot) Accuracy 78.9 # 4
Common Sense Reasoning ARC (Easy) LLaMA 7B (zero-shot) Accuracy 72.8 # 7
Question Answering BoolQ LLaMA 65B (zero-shot) Accuracy 85.3 # 8
Question Answering BoolQ LLaMA 7B (zero-shot) Accuracy 76.5 # 16
Question Answering BoolQ LLaMA 13B (zero-shot) Accuracy 78.1 # 15
Question Answering BoolQ LLaMA 33B (zero-shot) Accuracy 83.1 # 11
Stereotypical Bias Analysis CrowS-Pairs LLaMA 65B Gender 70.6 # 4
Religion 79.0 # 4
Race/Color 57.0 # 1
Sexual Orientation 81.0 # 4
Age 70.1 # 4
Nationality 64.2 # 4
Disability 66.7 # 1
Physical Appearance 77.8 # 4
Socioeconomic status 71.5 # 2
Overall 66.6 # 3
Arithmetic Reasoning GSM8K LLaMA 33B Accuracy 35.6 # 29
Parameters 33 # 29
Arithmetic Reasoning GSM8K LLaMA 13B-maj1@k Accuracy 29.3 # 32
Parameters 13 # 31
Arithmetic Reasoning GSM8K LLaMA 13B Accuracy 17.8 # 37
Parameters 13 # 31
Arithmetic Reasoning GSM8K LLaMA 33B-maj1@k Accuracy 53.1 # 23
Parameters 33 # 29
Arithmetic Reasoning GSM8K LLaMA 7B-maj1@k Accuracy 18.1 # 34
Parameters 7 # 37
Arithmetic Reasoning GSM8K LLaMA 7B Accuracy 11.0 # 39
Parameters 7 # 37
Arithmetic Reasoning GSM8K LLaMA 65B-maj1@k Accuracy 69.7 # 12
Parameters 65 # 24
Arithmetic Reasoning GSM8K LLaMA 65B Accuracy 50.9 # 26
Parameters 65 # 24
Sentence Completion HellaSwag LLaMA 13B (zero-shot) Accuracy 79.2 # 13
Sentence Completion HellaSwag LLaMA 33B (zero-shot) Accuracy 82.8 # 8
Sentence Completion HellaSwag LLaMA 65B (zero-shot) Accuracy 84.2 # 4
Sentence Completion HellaSwag LLaMA 7B (zero-shot) Accuracy 76.1 # 16
Code Generation HumanEval LLaMA 33B (zero-shot) Pass@1 21.7 # 12
Pass@100 70.7 # 3
Code Generation HumanEval LLaMA 7B (zero-shot) Pass@1 10.5 # 18
Pass@100 36.5 # 10
Code Generation HumanEval LLaMA 65B (zero-shot) Pass@1 23.7 # 9
Pass@100 79.3 # 1
Code Generation HumanEval LLaMA 13B (zero-shot) Pass@1 15.8 # 15
Pass@100 52.5 # 6
Math Word Problem Solving MATH LLaMA 65B (maj1@k) Accuracy 20.5 # 7
Parameters (Billions) 65 # 10
Math Word Problem Solving MATH LLaMA 7B Accuracy 2.9 # 31
Parameters (Billions) 7 # 22
Math Word Problem Solving MATH LLaMA 7B-maj1@k Accuracy 6.9 # 20
Parameters (Billions) 7 # 22
Math Word Problem Solving MATH LLaMA 13B Accuracy 3.9 # 29
Parameters (Billions) 13 # 17
Math Word Problem Solving MATH LLaMA 13B-maj1@k Accuracy 8.8 # 16
Parameters (Billions) 13 # 17
Math Word Problem Solving MATH LLaMA 33B Accuracy 7.1 # 19
Parameters (Billions) 33 # 13
Math Word Problem Solving MATH LLaMA 33B-maj1@k Accuracy 15.2 # 11
Parameters (Billions) 33 # 13
Math Word Problem Solving MATH LLaMA 65B Accuracy 10.6 # 15
Parameters (Billions) 65 # 10
Multi-task Language Understanding MMLU LLaMA 33B (few-shot, k=5) Humanities 55.8 # 8
Average (%) 57.8 # 21
Parameters (Billions) 33 # 24
STEM 46.0 # 13
Social Sciences 66.7 # 8
Other 63.4 # 8
Tokens (Billions) 1400 # 1
Multi-task Language Understanding MMLU LLaMA 13B (few-shot, k=5) Humanities 45.0 # 12
Average (%) 46.9 # 31
Parameters (Billions) 13 # 20
STEM 35.8 # 20
Social Sciences 53.8 # 12
Other 53.3 # 11
Multi-task Language Understanding MMLU LLaMA 7B (few-shot, k=5) Humanities 34.0 # 15
Average (%) 35.1 # 40
Parameters (Billions) 7 # 11
STEM 30.5 # 24
Social Sciences 38.3 # 15
Other 38.1 # 15
Multi-task Language Understanding MMLU LLaMA 65B (few-shot, k=5) Humanities 61.8 # 7
Average (%) 63.4 # 17
Parameters (Billions) 65 # 30
STEM 51.7 # 10
Social Sciences 72.9 # 6
Other 67.4 # 6
Tokens (Billions) 1400 # 1
Multi-task Language Understanding MMLU LLaMA 65B (fine-tuned) Average (%) 68.9 # 13
Parameters (Billions) 65 # 30
Tokens (Billions) 1400 # 1
Question Answering Natural Questions LLaMA 65B (one-shot) EM 31.0 # 22
Question Answering Natural Questions LLaMA 65B (few-shot, k=5) EM 35.0 # 20
Question Answering Natural Questions LLaMA 65B (few-shot, k=64) EM 39.9 # 17
Question Answering Natural Questions LLaMA 33B (zero-shot) EM 24.9 # 27
Question Answering OBQA LLaMA 33B (zero-shot) Accuracy 58.6 # 3
Question Answering OBQA LLaMA 13B (zero-shot) Accuracy 56.4 # 6
Question Answering OBQA LLaMA 7B (zero-shot) Accuracy 57.2 # 5
Question Answering OBQA LLaMA 65B (zero-shot) Accuracy 60.2 # 2
Question Answering PIQA LLaMA 65B (zero-shot) Accuracy 82.8 # 1
Question Answering PIQA LLaMA 13B (zero-shot) Accuracy 80.1 # 9
Question Answering PIQA LLaMA 33B (zero-shot) Accuracy 82.3 # 2
Question Answering PIQA LLaMA 7B (zero-shot) Accuracy 79.8 # 11
Reading Comprehension RACE LLaMA 33B (zero-shot) Accuracy (High) 48.3 # 9
Accuracy (Middle) 64.1 # 10
Reading Comprehension RACE LLaMA 65B (zero-shot) Accuracy (High) 51.6 # 7
Accuracy (Middle) 67.9 # 8
Reading Comprehension RACE LLaMA 7B (zero-shot) Accuracy (High) 46.9 # 12
Accuracy (Middle) 61.1 # 12
Reading Comprehension RACE LLaMA 13B (zero-shot) Accuracy (High) 47.2 # 11
Accuracy (Middle) 61.6 # 11
Question Answering SIQA LLaMA 33B (zero-shot) Accuracy 50.4 # 4
Question Answering SIQA LLaMA 13B (zero-shot) Accuracy 50.4 # 4
Question Answering SIQA LLaMA 7B (zero-shot) Accuracy 48.9 # 6
Question Answering SIQA LLaMA 65B (zero-shot) Accuracy 52.3 # 1
Question Answering TriviaQA LLaMA 65B (zero-shot) EM 68.2 # 16
Question Answering TriviaQA LLaMA 65B (few-shot, k=5) EM 72.6 # 9
Question Answering TriviaQA LLaMA 65B (few-shot, k=64) EM 73.0 # 8
Question Answering TriviaQA LLaMA 65B (one-shot) EM 71.6 # 12
Question Answering TruthfulQA LLaMA 13B % true 47 # 4
% info 41 # 7
Question Answering TruthfulQA LLaMA 65B % true 57 # 1
% info 53 # 5
Question Answering TruthfulQA LLaMA 33B % true 52 # 3
% info 48 # 6
Question Answering TruthfulQA LLaMA 7B % true 33 # 5
% info 29 # 8
Common Sense Reasoning WinoGrande LLaMA 7B (zero-shot) Accuracy 70.1 # 11
Common Sense Reasoning WinoGrande LLaMA 65B (zero-shot) Accuracy 77.0 # 4
Common Sense Reasoning WinoGrande LLaMA 33B (zero-shot) Accuracy 76.0 # 7
Common Sense Reasoning WinoGrande LLaMA 13B (zero-shot) Accuracy 73.0 # 9

Methods