OPT is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. The model uses an AdamW optimizer and weight decay of 0.1. It follows a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in the smaller models, and decaying down to 10% of the maximum LR over 300B tokens. The batch sizes range from 0.5M to 4M depending on the model size and is kept constant throughout the course of training.
Source: OPT: Open Pre-trained Transformer Language ModelsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 38 | 9.95% |
Quantization | 22 | 5.76% |
Large Language Model | 15 | 3.93% |
Question Answering | 13 | 3.40% |
In-Context Learning | 12 | 3.14% |
Text Generation | 9 | 2.36% |
Retrieval | 7 | 1.83% |
Translation | 7 | 1.83% |
Sentence | 6 | 1.57% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |