OPT is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. The model uses an AdamW optimizer and weight decay of 0.1. It follows a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in the smaller models, and decaying down to 10% of the maximum LR over 300B tokens. The batch sizes range from 0.5M to 4M depending on the model size and is kept constant throughout the course of training.
Source: OPT: Open Pre-trained Transformer Language ModelsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 22 | 12.57% |
Question Answering | 11 | 6.29% |
Quantization | 7 | 4.00% |
Large Language Model | 7 | 4.00% |
Machine Translation | 5 | 2.86% |
Text Generation | 4 | 2.29% |
Object Detection | 4 | 2.29% |
Instruction Following | 3 | 1.71% |
Retrieval | 3 | 1.71% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |