We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

PDF Abstract arXiv 2023 PDF arXiv 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Common Sense Reasoning ARC (Challenge) LLaMA 33B (zero-shot) Accuracy 57.8 # 15
Common Sense Reasoning ARC (Challenge) LLaMA 65B (zero-shot) Accuracy 56.0 # 16
Common Sense Reasoning ARC (Challenge) LLaMA 7B (zero-shot) Accuracy 47.6 # 26
Common Sense Reasoning ARC (Challenge) LLaMA 13B (zero-shot) Accuracy 52.7 # 19
Common Sense Reasoning ARC (Easy) LLaMA 33B (zero-shot) Accuracy 80.0 # 8
Common Sense Reasoning ARC (Easy) LLaMA 65B (zero-shot) Accuracy 78.9 # 11
Common Sense Reasoning ARC (Easy) LLaMA 7B (zero-shot) Accuracy 72.8 # 17
Common Sense Reasoning ARC (Easy) LLaMA 13B (zero-shot) Accuracy 74.8 # 14
Question Answering BoolQ LLaMA 65B (zero-shot) Accuracy 85.3 # 11