GPT-3 is an autoregressive transformer model with 175 billion parameters. It uses the same architecture/model as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization, with the exception that GPT-3 uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.
Source: Language Models are Few-Shot LearnersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 62 | 6.84% |
Large Language Model | 51 | 5.63% |
Language Modeling | 49 | 5.41% |
Question Answering | 48 | 5.30% |
RAG | 31 | 3.42% |
Retrieval | 29 | 3.20% |
In-Context Learning | 26 | 2.87% |
Code Generation | 26 | 2.87% |
Few-Shot Learning | 19 | 2.10% |