TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Natural Language Inference	ANLI test	PaLM 540B (Self Consistency)	A2	64.5	# 5
Natural Language Inference	ANLI test	PaLM 540B (Self Consistency)	A3	63.4	# 6
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Standard-Prompting)	A2	64.8	# 4
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Standard-Prompting)	A3	66.9	# 5
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, CoT Prompting)	A2	65.3	# 3
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, CoT Prompting)	A3	67.3	# 3
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Self Consistency)	A2	66.5	# 2
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Self Consistency)	A3	67.9	# 2
Natural Language Inference	ANLI test	PaLM 540B (Standard-Prompting)	A2	55.8	# 9
Natural Language Inference	ANLI test	PaLM 540B (Standard-Prompting)	A3	55.8	# 9
Natural Language Inference	ANLI test	PaLM 540B (CoT Prompting)	A2	58.9	# 8
Natural Language Inference	ANLI test	PaLM 540B (CoT Prompting)	A3	60.6	# 7
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Consistency)	Accuracy	88.7	# 6
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (CoT Prompting)	Accuracy	85.2	# 12
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	89.8	# 5
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Standard-Prompting)	Accuracy	87.1	# 9
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	87.2	# 8
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	88.3	# 7
Question Answering	DROP	PaLM 540B (Standard-Prompting)	Accuracy	60	# 6
Question Answering	DROP	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	83	# 1
Question Answering	DROP	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	76.2	# 3
Question Answering	DROP	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	71.7	# 4
Question Answering	DROP	PaLM 540B (Self Consistency)	Accuracy	78.2	# 2
Question Answering	DROP	PaLM 540B (CoT Prompting)	Accuracy	70.6	# 5
Arithmetic Reasoning	GSM8K	PaLM 540B (Standard-Prompting)	Accuracy	17.9	# 141
Arithmetic Reasoning	GSM8K	PaLM 540B (Standard-Prompting)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	82.1	# 55
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Self Consistency)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	73.5	# 86
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, CoT Prompting)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	32.2	# 134
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Standard-Prompting)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Consistency)	Accuracy	74.4	# 78
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Consistency)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	GSM8K	PaLM 540B (CoT Prompting)	Accuracy	56.5	# 112
Arithmetic Reasoning	GSM8K	PaLM 540B (CoT Prompting)	Parameters (Billion)	540	# 111
Question Answering	OpenBookQA	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	94.4	# 3
Question Answering	OpenBookQA	PaLM 540B (Standard-Prompting)	Accuracy	84.4	# 15
Question Answering	OpenBookQA	PaLM 540B (CoT Prompting)	Accuracy	86.4	# 14
Question Answering	OpenBookQA	PaLM 540B (Self Consistency)	Accuracy	90	# 8
Question Answering	OpenBookQA	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	92	# 6
Question Answering	OpenBookQA	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	93	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-can-self-improve/question-answering-on-drop)](https://paperswithcode.com/sota/question-answering-on-drop?p=large-language-models-can-self-improve)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-can-self-improve/natural-language-inference-on-anli-test)](https://paperswithcode.com/sota/natural-language-inference-on-anli-test?p=large-language-models-can-self-improve)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-can-self-improve/question-answering-on-openbookqa)](https://paperswithcode.com/sota/question-answering-on-openbookqa?p=large-language-models-can-self-improve)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-can-self-improve/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=large-language-models-can-self-improve)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-can-self-improve/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=large-language-models-can-self-improve)`

Large Language Models Can Self-Improve

20 Oct 2022 · Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han ·

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Arithmetic Reasoning

Common Sense Reasoning

GSM8K

Natural Language Inference

Question Answering

Datasets

GLUE

MultiNLI

GSM8K

OpenBookQA

DROP

ANLI

SVAMP

StrategyQA

ARC (AI2 Reasoning Challenge)

Results from the Paper

Edit

Ranked #1 on Question Answering on DROP

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Natural Language Inference	ANLI test	PaLM 540B (Self Consistency)	A2	64.5	# 5	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Consistency)	A3	63.4	# 6	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Standard-Prompting)	A2	64.8	# 4	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Standard-Prompting)	A3	66.9	# 5	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, CoT Prompting)	A2	65.3	# 3	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, CoT Prompting)	A3	67.3	# 3	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Self Consistency)	A2	66.5	# 2	Compare
Natural Language Inference	ANLI test	PaLM 540B (Self Improvement, Self Consistency)	A3	67.9	# 2	Compare
Natural Language Inference	ANLI test	PaLM 540B (Standard-Prompting)	A2	55.8	# 9	Compare
Natural Language Inference	ANLI test	PaLM 540B (Standard-Prompting)	A3	55.8	# 9	Compare
Natural Language Inference	ANLI test	PaLM 540B (CoT Prompting)	A2	58.9	# 8	Compare
Natural Language Inference	ANLI test	PaLM 540B (CoT Prompting)	A3	60.6	# 7	Compare
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Consistency)	Accuracy	88.7	# 6	Compare
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (CoT Prompting)	Accuracy	85.2	# 12	Compare
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	89.8	# 5	Compare
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Standard-Prompting)	Accuracy	87.1	# 9	Compare
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	87.2	# 8	Compare
Common Sense Reasoning	ARC (Challenge)	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	88.3	# 7	Compare
Question Answering	DROP	PaLM 540B (Standard-Prompting)	Accuracy	60	# 6	Compare
Question Answering	DROP	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	83	# 1	Compare
Question Answering	DROP	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	76.2	# 3	Compare
Question Answering	DROP	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	71.7	# 4	Compare
Question Answering	DROP	PaLM 540B (Self Consistency)	Accuracy	78.2	# 2	Compare
Question Answering	DROP	PaLM 540B (CoT Prompting)	Accuracy	70.6	# 5	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Standard-Prompting)	Accuracy	17.9	# 141	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Standard-Prompting)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	82.1	# 55	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Self Consistency)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	73.5	# 86	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, CoT Prompting)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	32.2	# 134	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Improvement, Standard-Prompting)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Consistency)	Accuracy	74.4	# 78	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (Self Consistency)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (CoT Prompting)	Accuracy	56.5	# 112	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (CoT Prompting)	Parameters (Billion)	540	# 111	Compare
Question Answering	OpenBookQA	PaLM 540B (Self Improvement, Self Consistency)	Accuracy	94.4	# 3	Compare
Question Answering	OpenBookQA	PaLM 540B (Standard-Prompting)	Accuracy	84.4	# 15	Compare
Question Answering	OpenBookQA	PaLM 540B (CoT Prompting)	Accuracy	86.4	# 14	Compare
Question Answering	OpenBookQA	PaLM 540B (Self Consistency)	Accuracy	90	# 8	Compare
Question Answering	OpenBookQA	PaLM 540B (Self Improvement, Standard-Prompting)	Accuracy	92	# 6	Compare
Question Answering	OpenBookQA	PaLM 540B (Self Improvement, CoT Prompting)	Accuracy	93	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Large Language Models Can Self-Improve

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove