TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Natural Questions	TheoremQA	GPT-4 (PoT)	Accuracy	52.4	# 1
Natural Questions	TheoremQA	GPT-3.5-turbo (CoT)	Accuracy	30.2	# 5
Natural Questions	TheoremQA	Claude-instant (CoT)	Accuracy	23.6	# 9
Natural Questions	TheoremQA	PaLM-2-bison (CoT)	Accuracy	21.0	# 11
Natural Questions	TheoremQA	text-davinci-003	Accuracy	22.8	# 10
Natural Questions	TheoremQA	code-davinci-002	Accuracy	23.9	# 8
Natural Questions	TheoremQA	Claude-v1 (CoT)	Accuracy	24.9	# 7
Natural Questions	TheoremQA	Claude-v1 (PoT)	Accuracy	25.9	# 6
Natural Questions	TheoremQA	PaLM-2-unicorn (CoT)	Accuracy	31.8	# 4
Natural Questions	TheoremQA	GPT-3.5-turbo (PoT)	Accuracy	35.6	# 3
Natural Questions	TheoremQA	GPT-4 (CoT)	Accuracy	43.8	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/theoremqa-a-theorem-driven-question-answering/natural-questions-on-theoremqa)](https://paperswithcode.com/sota/natural-questions-on-theoremqa?p=theoremqa-a-theorem-driven-question-answering)`

TheoremQA: A Theorem-driven Question Answering dataset

21 May 2023 · Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia ·

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

PDF Abstract

Code

Add Remove Mark official

wenhuchen/theoremqa official

152

Tasks

Add Remove

Math

Question Answering

Datasets

Introduced in the Paper:

TheoremQA

Used in the Paper:

GSM8K ASDiv

Lila

Results from the Paper

Edit

Ranked #1 on Natural Questions on TheoremQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Natural Questions	TheoremQA	GPT-4 (PoT)	Accuracy	52.4	# 1	Compare
Natural Questions	TheoremQA	GPT-3.5-turbo (CoT)	Accuracy	30.2	# 5	Compare
Natural Questions	TheoremQA	Claude-instant (CoT)	Accuracy	23.6	# 9	Compare
Natural Questions	TheoremQA	PaLM-2-bison (CoT)	Accuracy	21.0	# 11	Compare
Natural Questions	TheoremQA	text-davinci-003	Accuracy	22.8	# 10	Compare
Natural Questions	TheoremQA	code-davinci-002	Accuracy	23.9	# 8	Compare
Natural Questions	TheoremQA	Claude-v1 (CoT)	Accuracy	24.9	# 7	Compare
Natural Questions	TheoremQA	Claude-v1 (PoT)	Accuracy	25.9	# 6	Compare
Natural Questions	TheoremQA	PaLM-2-unicorn (CoT)	Accuracy	31.8	# 4	Compare
Natural Questions	TheoremQA	GPT-3.5-turbo (PoT)	Accuracy	35.6	# 3	Compare
Natural Questions	TheoremQA	GPT-4 (CoT)	Accuracy	43.8	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

TheoremQA: A Theorem-driven Question Answering dataset

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove