TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Arithmetic Reasoning	GSM8K	GPT-4 Code Interpreter (CSV, K=5)	Accuracy	97.0	# 1
Math Word Problem Solving	MATH	GPT-4-code model (CSV, w/ code, SC, k=16)	Accuracy	84.3	# 1
Math Word Problem Solving	MATH	GPT-4-code model (w/o code)	Accuracy	60.8	# 7
Math Word Problem Solving	MATH	GPT-4-code model (CSV, w/ code)	Accuracy	73.5	# 2
Math Word Problem Solving	MATH	GPT-4-code model (w/ code)	Accuracy	69.7	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/solving-challenging-math-word-problems-using/arithmetic-reasoning-on-gsm8k)](https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?p=solving-challenging-math-word-problems-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/solving-challenging-math-word-problems-using/math-word-problem-solving-on-math)](https://paperswithcode.com/sota/math-word-problem-solving-on-math?p=solving-challenging-math-word-problems-using)`

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

15 Aug 2023 · Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li ·

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.

PDF Abstract

Code

Add Remove Mark official

kipok/nemo-skills

100

Tasks

Add Remove

Arithmetic Reasoning

Math

Mathematical Reasoning

Math Word Problem Solving

Datasets

MMLU

GSM8K

MATH

Results from the Paper

Add Remove

Ranked #1 on Math Word Problem Solving on MATH

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Arithmetic Reasoning	GSM8K	GPT-4 Code Interpreter (CSV, K=5)	Accuracy	97.0	# 1	Compare
Math Word Problem Solving	MATH	GPT-4-code model (CSV, w/ code, SC, k=16)	Accuracy	84.3	# 1	Compare
Math Word Problem Solving	MATH	GPT-4-code model (w/o code)	Accuracy	60.8	# 7	Compare
Math Word Problem Solving	MATH	GPT-4-code model (CSV, w/ code)	Accuracy	73.5	# 2	Compare
Math Word Problem Solving	MATH	GPT-4-code model (w/ code)	Accuracy	69.7	# 4	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove