GPT-4 Technical Report

Preprint 2023  ·  OpenAI ·

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

PDF Abstract Preprint 2023 PDF Preprint 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Common Sense Reasoning ARC (Challenge) GPT-3.5 (few-shot, k=25) Accuracy 85.2 # 3
Common Sense Reasoning ARC (Challenge) GPT-4 (few-shot, k=25) Accuracy 96.3 # 1
Question Answering DROP Test GPT-4 (few-shot, k=3) F1 80.9 # 5
Question Answering DROP Test GPT 3.5 (few-shot, k=3) F1 64.1 # 10
Arithmetic Reasoning GSM8K GPT-4 (few-shot, k=5, CoT) Accuracy 92 # 1
Arithmetic Reasoning GSM8K GPT-3.5 (few-shot, k=5) Accuracy 57.1 # 19
Sentence Completion HellaSwag GPT-3.5 (few-shot, k=10) Accuracy 85.5 # 3
Sentence Completion HellaSwag GPT-4 (few-shot, k=10) Accuracy 95.3 # 1
Code Generation HumanEval GPT-4 (zero-shot) Pass@1 67.0 # 1
Code Generation HumanEval GPT-3.5 (zero-shot) Pass@1 48.1 # 4
Multi-task Language Understanding MMLU GPT-4 (few-shot, k=5) Average (%) 86.4 # 1
Multi-task Language Understanding MMLU GPT-3.5 (few-shot, k=5) Average (%) 70 # 10
Common Sense Reasoning WinoGrande GPT-3.5 (few-shot, k=5) Accuracy 81.6 # 2
Common Sense Reasoning WinoGrande GPT-4 (few-shot, k=5) Accuracy 87.5 # 1

Methods


No methods listed for this paper. Add relevant methods here