12 papers with code • 1 benchmarks • 1 datasets
To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8. 3 billion parameter transformer language model similar to GPT-2 and a 3. 9 billion parameter model similar to BERT.
Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times.
We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task.
Attention is a commonly used mechanism in sequence processing, but it is of O(n^2) complexity which prevents its application to long sequences.
To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate.