TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Common Sense Reasoning
|
ARC (Challenge)
|
LLaMA 13B (zero-shot)
|
Accuracy
|
52.7
|
# 8
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
LLaMA 65B (zero-shot)
|
Accuracy
|
56.0
|
# 6
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
LLaMA 33B (zero-shot)
|
Accuracy
|
57.8
|
# 5
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
LLaMA 7B (zero-shot)
|
Accuracy
|
47.6
|
# 13
|
|
Common Sense Reasoning
|
ARC (Easy)
|
LLaMA 13B (zero-shot)
|
Accuracy
|
74.8
|
# 5
|
|
Common Sense Reasoning
|
ARC (Easy)
|
LLaMA 33B (zero-shot)
|
Accuracy
|
80.0
|
# 3
|
|
Common Sense Reasoning
|
ARC (Easy)
|
LLaMA 65B (zero-shot)
|
Accuracy
|
78.9
|
# 4
|
|
Common Sense Reasoning
|
ARC (Easy)
|
LLaMA 7B (zero-shot)
|
Accuracy
|
72.8
|
# 7
|
|
Question Answering
|
BoolQ
|
LLaMA 65B (zero-shot)
|
Accuracy
|
85.3
|
# 8
|
|
Question Answering
|
BoolQ
|
LLaMA 7B (zero-shot)
|
Accuracy
|
76.5
|
# 16
|
|
Question Answering
|
BoolQ
|
LLaMA 13B (zero-shot)
|
Accuracy
|
78.1
|
# 15
|
|
Question Answering
|
BoolQ
|
LLaMA 33B (zero-shot)
|
Accuracy
|
83.1
|
# 11
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Gender
|
70.6
|
# 4
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Religion
|
79.0
|
# 4
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Race/Color
|
57.0
|
# 1
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Sexual Orientation
|
81.0
|
# 4
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Age
|
70.1
|
# 4
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Nationality
|
64.2
|
# 4
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Disability
|
66.7
|
# 1
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Physical Appearance
|
77.8
|
# 4
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Socioeconomic status
|
71.5
|
# 2
|
|
Stereotypical Bias Analysis
|
CrowS-Pairs
|
LLaMA 65B
|
Overall
|
66.6
|
# 3
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 33B
|
Accuracy
|
35.6
|
# 29
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 33B
|
Parameters
|
33
|
# 29
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 13B-maj1@k
|
Accuracy
|
29.3
|
# 32
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 13B-maj1@k
|
Parameters
|
13
|
# 31
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 13B
|
Accuracy
|
17.8
|
# 37
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 13B
|
Parameters
|
13
|
# 31
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 33B-maj1@k
|
Accuracy
|
53.1
|
# 23
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 33B-maj1@k
|
Parameters
|
33
|
# 29
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 7B-maj1@k
|
Accuracy
|
18.1
|
# 34
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 7B-maj1@k
|
Parameters
|
7
|
# 37
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 7B
|
Accuracy
|
11.0
|
# 39
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 7B
|
Parameters
|
7
|
# 37
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 65B-maj1@k
|
Accuracy
|
69.7
|
# 12
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 65B-maj1@k
|
Parameters
|
65
|
# 24
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 65B
|
Accuracy
|
50.9
|
# 26
|
|
Arithmetic Reasoning
|
GSM8K
|
LLaMA 65B
|
Parameters
|
65
|
# 24
|
|
Sentence Completion
|
HellaSwag
|
LLaMA 13B (zero-shot)
|
Accuracy
|
79.2
|
# 13
|
|
Sentence Completion
|
HellaSwag
|
LLaMA 33B (zero-shot)
|
Accuracy
|
82.8
|
# 8
|
|
Sentence Completion
|
HellaSwag
|
LLaMA 65B (zero-shot)
|
Accuracy
|
84.2
|
# 4
|
|
Sentence Completion
|
HellaSwag
|
LLaMA 7B (zero-shot)
|
Accuracy
|
76.1
|
# 16
|
|
Code Generation
|
HumanEval
|
LLaMA 33B (zero-shot)
|
Pass@1
|
21.7
|
# 12
|
|
Code Generation
|
HumanEval
|
LLaMA 33B (zero-shot)
|
Pass@100
|
70.7
|
# 3
|
|
Code Generation
|
HumanEval
|
LLaMA 7B (zero-shot)
|
Pass@1
|
10.5
|
# 18
|
|
Code Generation
|
HumanEval
|
LLaMA 7B (zero-shot)
|
Pass@100
|
36.5
|
# 10
|
|
Code Generation
|
HumanEval
|
LLaMA 65B (zero-shot)
|
Pass@1
|
23.7
|
# 9
|
|
Code Generation
|
HumanEval
|
LLaMA 65B (zero-shot)
|
Pass@100
|
79.3
|
# 1
|
|
Code Generation
|
HumanEval
|
LLaMA 13B (zero-shot)
|
Pass@1
|
15.8
|
# 15
|
|
Code Generation
|
HumanEval
|
LLaMA 13B (zero-shot)
|
Pass@100
|
52.5
|
# 6
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 65B (maj1@k)
|
Accuracy
|
20.5
|
# 7
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 65B (maj1@k)
|
Parameters (Billions)
|
65
|
# 10
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 7B
|
Accuracy
|
2.9
|
# 31
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 7B
|
Parameters (Billions)
|
7
|
# 22
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 7B-maj1@k
|
Accuracy
|
6.9
|
# 20
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 7B-maj1@k
|
Parameters (Billions)
|
7
|
# 22
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 13B
|
Accuracy
|
3.9
|
# 29
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 13B
|
Parameters (Billions)
|
13
|
# 17
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 13B-maj1@k
|
Accuracy
|
8.8
|
# 16
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 13B-maj1@k
|
Parameters (Billions)
|
13
|
# 17
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 33B
|
Accuracy
|
7.1
|
# 19
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 33B
|
Parameters (Billions)
|
33
|
# 13
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 33B-maj1@k
|
Accuracy
|
15.2
|
# 11
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 33B-maj1@k
|
Parameters (Billions)
|
33
|
# 13
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 65B
|
Accuracy
|
10.6
|
# 15
|
|
Math Word Problem Solving
|
MATH
|
LLaMA 65B
|
Parameters (Billions)
|
65
|
# 10
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
Humanities
|
55.8
|
# 8
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
Average (%)
|
57.8
|
# 21
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
Parameters (Billions)
|
33
|
# 24
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
STEM
|
46.0
|
# 13
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
Social Sciences
|
66.7
|
# 8
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
Other
|
63.4
|
# 8
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 33B (few-shot, k=5)
|
Tokens (Billions)
|
1400
|
# 1
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 13B (few-shot, k=5)
|
Humanities
|
45.0
|
# 12
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 13B (few-shot, k=5)
|
Average (%)
|
46.9
|
# 31
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 13B (few-shot, k=5)
|
Parameters (Billions)
|
13
|
# 20
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 13B (few-shot, k=5)
|
STEM
|
35.8
|
# 20
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 13B (few-shot, k=5)
|
Social Sciences
|
53.8
|
# 12
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 13B (few-shot, k=5)
|
Other
|
53.3
|
# 11
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 7B (few-shot, k=5)
|
Humanities
|
34.0
|
# 15
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 7B (few-shot, k=5)
|
Average (%)
|
35.1
|
# 40
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 7B (few-shot, k=5)
|
Parameters (Billions)
|
7
|
# 11
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 7B (few-shot, k=5)
|
STEM
|
30.5
|
# 24
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 7B (few-shot, k=5)
|
Social Sciences
|
38.3
|
# 15
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 7B (few-shot, k=5)
|
Other
|
38.1
|
# 15
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
Humanities
|
61.8
|
# 7
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
Average (%)
|
63.4
|
# 17
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
Parameters (Billions)
|
65
|
# 30
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
STEM
|
51.7
|
# 10
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
Social Sciences
|
72.9
|
# 6
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
Other
|
67.4
|
# 6
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (few-shot, k=5)
|
Tokens (Billions)
|
1400
|
# 1
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (fine-tuned)
|
Average (%)
|
68.9
|
# 13
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (fine-tuned)
|
Parameters (Billions)
|
65
|
# 30
|
|
Multi-task Language Understanding
|
MMLU
|
LLaMA 65B (fine-tuned)
|
Tokens (Billions)
|
1400
|
# 1
|
|
Question Answering
|
Natural Questions
|
LLaMA 65B (one-shot)
|
EM
|
31.0
|
# 22
|
|
Question Answering
|
Natural Questions
|
LLaMA 65B (few-shot, k=5)
|
EM
|
35.0
|
# 20
|
|
Question Answering
|
Natural Questions
|
LLaMA 65B (few-shot, k=64)
|
EM
|
39.9
|
# 17
|
|
Question Answering
|
Natural Questions
|
LLaMA 33B (zero-shot)
|
EM
|
24.9
|
# 27
|
|
Question Answering
|
OBQA
|
LLaMA 33B (zero-shot)
|
Accuracy
|
58.6
|
# 3
|
|
Question Answering
|
OBQA
|
LLaMA 13B (zero-shot)
|
Accuracy
|
56.4
|
# 6
|
|
Question Answering
|
OBQA
|
LLaMA 7B (zero-shot)
|
Accuracy
|
57.2
|
# 5
|
|
Question Answering
|
OBQA
|
LLaMA 65B (zero-shot)
|
Accuracy
|
60.2
|
# 2
|
|
Question Answering
|
PIQA
|
LLaMA 65B (zero-shot)
|
Accuracy
|
82.8
|
# 1
|
|
Question Answering
|
PIQA
|
LLaMA 13B (zero-shot)
|
Accuracy
|
80.1
|
# 9
|
|
Question Answering
|
PIQA
|
LLaMA 33B (zero-shot)
|
Accuracy
|
82.3
|
# 2
|
|
Question Answering
|
PIQA
|
LLaMA 7B (zero-shot)
|
Accuracy
|
79.8
|
# 11
|
|
Reading Comprehension
|
RACE
|
LLaMA 33B (zero-shot)
|
Accuracy (High)
|
48.3
|
# 9
|
|
Reading Comprehension
|
RACE
|
LLaMA 33B (zero-shot)
|
Accuracy (Middle)
|
64.1
|
# 10
|
|
Reading Comprehension
|
RACE
|
LLaMA 65B (zero-shot)
|
Accuracy (High)
|
51.6
|
# 7
|
|
Reading Comprehension
|
RACE
|
LLaMA 65B (zero-shot)
|
Accuracy (Middle)
|
67.9
|
# 8
|
|
Reading Comprehension
|
RACE
|
LLaMA 7B (zero-shot)
|
Accuracy (High)
|
46.9
|
# 12
|
|
Reading Comprehension
|
RACE
|
LLaMA 7B (zero-shot)
|
Accuracy (Middle)
|
61.1
|
# 12
|
|
Reading Comprehension
|
RACE
|
LLaMA 13B (zero-shot)
|
Accuracy (High)
|
47.2
|
# 11
|
|
Reading Comprehension
|
RACE
|
LLaMA 13B (zero-shot)
|
Accuracy (Middle)
|
61.6
|
# 11
|
|
Question Answering
|
SIQA
|
LLaMA 33B (zero-shot)
|
Accuracy
|
50.4
|
# 4
|
|
Question Answering
|
SIQA
|
LLaMA 13B (zero-shot)
|
Accuracy
|
50.4
|
# 4
|
|
Question Answering
|
SIQA
|
LLaMA 7B (zero-shot)
|
Accuracy
|
48.9
|
# 6
|
|
Question Answering
|
SIQA
|
LLaMA 65B (zero-shot)
|
Accuracy
|
52.3
|
# 1
|
|
Question Answering
|
TriviaQA
|
LLaMA 65B (zero-shot)
|
EM
|
68.2
|
# 16
|
|
Question Answering
|
TriviaQA
|
LLaMA 65B (few-shot, k=5)
|
EM
|
72.6
|
# 9
|
|
Question Answering
|
TriviaQA
|
LLaMA 65B (few-shot, k=64)
|
EM
|
73.0
|
# 8
|
|
Question Answering
|
TriviaQA
|
LLaMA 65B (one-shot)
|
EM
|
71.6
|
# 12
|
|
Question Answering
|
TruthfulQA
|
LLaMA 13B
|
% true
|
47
|
# 4
|
|
Question Answering
|
TruthfulQA
|
LLaMA 13B
|
% info
|
41
|
# 7
|
|
Question Answering
|
TruthfulQA
|
LLaMA 65B
|
% true
|
57
|
# 1
|
|
Question Answering
|
TruthfulQA
|
LLaMA 65B
|
% info
|
53
|
# 5
|
|
Question Answering
|
TruthfulQA
|
LLaMA 33B
|
% true
|
52
|
# 3
|
|
Question Answering
|
TruthfulQA
|
LLaMA 33B
|
% info
|
48
|
# 6
|
|
Question Answering
|
TruthfulQA
|
LLaMA 7B
|
% true
|
33
|
# 5
|
|
Question Answering
|
TruthfulQA
|
LLaMA 7B
|
% info
|
29
|
# 8
|
|
Common Sense Reasoning
|
WinoGrande
|
LLaMA 7B (zero-shot)
|
Accuracy
|
70.1
|
# 11
|
|
Common Sense Reasoning
|
WinoGrande
|
LLaMA 65B (zero-shot)
|
Accuracy
|
77.0
|
# 4
|
|
Common Sense Reasoning
|
WinoGrande
|
LLaMA 33B (zero-shot)
|
Accuracy
|
76.0
|
# 7
|
|
Common Sense Reasoning
|
WinoGrande
|
LLaMA 13B (zero-shot)
|
Accuracy
|
73.0
|
# 9
|
|