GPT-3

Introduced by Brown et al. in Language Models are Few-Shot Learners

GPT-3 is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.

Source: Language Models are Few-Shot Learners

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	82	10.79%
Large Language Model	49	6.45%
Question Answering	48	6.32%
Prompt Engineering	30	3.95%
Retrieval	30	3.95%
Code Generation	28	3.68%
In-Context Learning	28	3.68%
Sentence	23	3.03%
Benchmarking	18	2.37%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Attention Dropout	Regularization
BPE	Subword Segmentation
Dense Connections	Feedforward Networks
Dropout	Regularization
Fixed Factorized Attention	Attention Patterns
GELU	Activation Functions
Layer Normalization	Normalization
Linear Warmup With Cosine Annealing	Learning Rate Schedules
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions
Strided Attention	Attention Patterns
Weight Decay	Regularization

Categories

Add Remove

Transformers

Language Models

Autoregressive Transformers