LAMBADA
12 papers with code • 1 benchmarks • 1 datasets
Most implemented papers
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8. 3 billion parameter transformer language model similar to GPT-2 and a 3. 9 billion parameter model similar to BERT.
Universal Transformers
Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times.
The LAMBADA dataset: Word prediction requiring a broad discourse context
We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task.
Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences
Attention is a commonly used mechanism in sequence processing, but it is of O(n^2) complexity which prevents its application to long sequences.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world.
Training Compute-Optimal Large Language Models
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.
Entity Tracking Improves Cloze-style Reading Comprehension
Reading comprehension tasks test the ability of models to process long-term context and remember salient information.
Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time
A key requirement in sequence to sequence processing is the modeling of long range dependencies.
Not Enough Data? Deep Learning to the Rescue!
Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks.
Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time
A key requirement in sequence to sequence processing is the modeling of long range dependencies.