Large Language Models Can Self-Improve

20 Oct 2022  ยท  Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han ยท

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Natural Language Inference ANLI test PaLM 540B (Self Consistency) A2 64.5 # 5
A3 63.4 # 6
Natural Language Inference ANLI test PaLM 540B (Self Improvement, Standard-Prompting) A2 64.8 # 4
A3 66.9 # 5
Natural Language Inference ANLI test PaLM 540B (Self Improvement, CoT Prompting) A2 65.3 # 3
A3 67.3 # 3
Natural Language Inference ANLI test PaLM 540B (Self Improvement, Self Consistency) A2 66.5 # 2
A3 67.9 # 2
Natural Language Inference ANLI test PaLM 540B (Standard-Prompting) A2 55.8 # 9
A3 55.8 # 9
Natural Language Inference ANLI test PaLM 540B (CoT Prompting) A2 58.9 # 8
A3 60.6 # 7
Common Sense Reasoning ARC (Challenge) PaLM 540B (Self Consistency) Accuracy 88.7 # 6
Common Sense Reasoning ARC (Challenge) PaLM 540B (CoT Prompting) Accuracy 85.2 # 12
Common Sense Reasoning ARC (Challenge) PaLM 540B (Self Improvement, Self Consistency) Accuracy 89.8 # 5
Common Sense Reasoning ARC (Challenge) PaLM 540B (Standard-Prompting) Accuracy 87.1 # 9
Common Sense Reasoning ARC (Challenge) PaLM 540B (Self Improvement, Standard-Prompting) Accuracy 87.2 # 8
Common Sense Reasoning ARC (Challenge) PaLM 540B (Self Improvement, CoT Prompting) Accuracy 88.3 # 7
Question Answering DROP PaLM 540B (Standard-Prompting) Accuracy 60 # 6
Question Answering DROP PaLM 540B (Self Improvement, Self Consistency) Accuracy 83 # 1
Question Answering DROP PaLM 540B (Self Improvement, CoT Prompting) Accuracy 76.2 # 3
Question Answering DROP PaLM 540B (Self Improvement, Standard-Prompting) Accuracy 71.7 # 4
Question Answering DROP PaLM 540B (Self Consistency) Accuracy 78.2 # 2
Question Answering DROP PaLM 540B (CoT Prompting) Accuracy 70.6 # 5
Arithmetic Reasoning GSM8K PaLM 540B (Standard-Prompting) Accuracy 17.9 # 141
Parameters (Billion) 540 # 111
Arithmetic Reasoning GSM8K PaLM 540B (Self Improvement, Self Consistency) Accuracy 82.1 # 55
Parameters (Billion) 540 # 111
Arithmetic Reasoning GSM8K PaLM 540B (Self Improvement, CoT Prompting) Accuracy 73.5 # 86
Parameters (Billion) 540 # 111
Arithmetic Reasoning GSM8K PaLM 540B (Self Improvement, Standard-Prompting) Accuracy 32.2 # 134
Parameters (Billion) 540 # 111
Arithmetic Reasoning GSM8K PaLM 540B (Self Consistency) Accuracy 74.4 # 78
Parameters (Billion) 540 # 111
Arithmetic Reasoning GSM8K PaLM 540B (CoT Prompting) Accuracy 56.5 # 112
Parameters (Billion) 540 # 111
Question Answering OpenBookQA PaLM 540B (Self Improvement, Self Consistency) Accuracy 94.4 # 3
Question Answering OpenBookQA PaLM 540B (Standard-Prompting) Accuracy 84.4 # 15
Question Answering OpenBookQA PaLM 540B (CoT Prompting) Accuracy 86.4 # 14
Question Answering OpenBookQA PaLM 540B (Self Consistency) Accuracy 90 # 8
Question Answering OpenBookQA PaLM 540B (Self Improvement, Standard-Prompting) Accuracy 92 # 6
Question Answering OpenBookQA PaLM 540B (Self Improvement, CoT Prompting) Accuracy 93 # 5

Methods


No methods listed for this paper. Add relevant methods here