TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (0-shot)	Accuracy	10.4	# 148
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (0-shot)	Parameters (Billion)	175	# 103
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (0-shot, CoT)	Accuracy	40.7	# 128
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (0-shot, CoT)	Parameters (Billion)	175	# 103
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (2-shot, CoT)	Accuracy	41.3	# 126
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (2-shot, CoT)	Parameters (Billion)	175	# 103
Arithmetic Reasoning	GSM8K	PaLM 540B (few-shot)	Accuracy	17.9	# 141
Arithmetic Reasoning	GSM8K	PaLM 540B (few-shot)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	Accuracy	51.5	# 120
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	Parameters (Billion)	175	# 103
Arithmetic Reasoning	GSM8K	Finetuned GPT-3 175B + verifier	Accuracy	55.0	# 115
Arithmetic Reasoning	GSM8K	Finetuned GPT-3 175B + verifier	Parameters (Billion)	175	# 103
Arithmetic Reasoning	GSM8K	PaLM-540B (few-Shot-cot)	Accuracy	58.1	# 108
Arithmetic Reasoning	GSM8K	PaLM-540B (few-Shot-cot)	Parameters (Billion)	540	# 111
Arithmetic Reasoning	MultiArith	Text-davinci-002 (175B) (zero-shot)	Accuracy	17.7	# 2
Arithmetic Reasoning	MultiArith	Text-davinci-002 (175B)(zero-shot-cot)	Accuracy	78.7	# 1
Common Sense Reasoning	ReCoRD	GPT-3 175B (one-shot)	F1	90.2	# 13
Math Word Problem Solving	SVAMP	PaLM (zero-shot)	Execution Accuracy	58.8	# 10
Math Word Problem Solving	SVAMP	PaLM (zero-shot, CoT)	Execution Accuracy	62.1	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-zero-shot-reasoners/arithmetic-reasoning-on-multiarith)](https://paperswithcode.com/sota/arithmetic-reasoning-on-multiarith?p=large-language-models-are-zero-shot-reasoners)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-zero-shot-reasoners/math-word-problem-solving-on-svamp)](https://paperswithcode.com/sota/math-word-problem-solving-on-svamp?p=large-language-models-are-zero-shot-reasoners)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-zero-shot-reasoners/common-sense-reasoning-on-record)](https://paperswithcode.com/sota/common-sense-reasoning-on-record?p=large-language-models-are-zero-shot-reasoners)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/large-language-models-are-zero-shot-reasoners/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=large-language-models-are-zero-shot-reasoners)`

Large Language Models are Zero-Shot Reasoners

24 May 2022 · Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa ·

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

PDF Abstract

Code

Add Remove Mark official

kojima-takeshi188/zero_shot_cot official

339

skytliang/multi-agents-debate

172

nicolay-r/reasoning-for-sentiment-a…

↳ Quickstart in

Colab

Tasks

Add Remove

Arithmetic Reasoning

Common Sense Reasoning

Date Understanding

Few-Shot Learning

GSM8K

Logical Reasoning

Math Word Problem Solving

Datasets

GSM8K

CommonsenseQA

BIG-bench

SVAMP

StrategyQA

ReCoRD AQUA-RAT

Results from the Paper

Edit

Ranked #1 on Arithmetic Reasoning on MultiArith

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (0-shot)	Accuracy	10.4	# 148	Compare
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (0-shot)	Parameters (Billion)	175	# 103	Compare
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (0-shot, CoT)	Accuracy	40.7	# 128	Compare
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (0-shot, CoT)	Parameters (Billion)	175	# 103	Compare
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (2-shot, CoT)	Accuracy	41.3	# 126	Compare
Arithmetic Reasoning	GSM8K	text-davinci-002 175B (2-shot, CoT)	Parameters (Billion)	175	# 103	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (few-shot)	Accuracy	17.9	# 141	Compare
Arithmetic Reasoning	GSM8K	PaLM 540B (few-shot)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	Accuracy	51.5	# 120	Compare
Arithmetic Reasoning	GSM8K	Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples))	Parameters (Billion)	175	# 103	Compare
Arithmetic Reasoning	GSM8K	Finetuned GPT-3 175B + verifier	Accuracy	55.0	# 115	Compare
Arithmetic Reasoning	GSM8K	Finetuned GPT-3 175B + verifier	Parameters (Billion)	175	# 103	Compare
Arithmetic Reasoning	GSM8K	PaLM-540B (few-Shot-cot)	Accuracy	58.1	# 108	Compare
Arithmetic Reasoning	GSM8K	PaLM-540B (few-Shot-cot)	Parameters (Billion)	540	# 111	Compare
Arithmetic Reasoning	MultiArith	Text-davinci-002 (175B) (zero-shot)	Accuracy	17.7	# 2	Compare
Arithmetic Reasoning	MultiArith	Text-davinci-002 (175B)(zero-shot-cot)	Accuracy	78.7	# 1	Compare
Common Sense Reasoning	ReCoRD	GPT-3 175B (one-shot)	F1	90.2	# 13	Compare
Math Word Problem Solving	SVAMP	PaLM (zero-shot)	Execution Accuracy	58.8	# 10	Compare
Math Word Problem Solving	SVAMP	PaLM (zero-shot, CoT)	Execution Accuracy	62.1	# 9	Compare

Methods

Add Remove

PaLM

Edit Social Preview

Large Language Models are Zero-Shot Reasoners

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove