# LAMBADA

12 papers with code • 1 benchmarks • 1 datasets

## Most implemented papers

# Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8. 3 billion parameter transformer language model similar to GPT-2 and a 3. 9 billion parameter model similar to BERT.

# Universal Transformers

Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times.

# The LAMBADA dataset: Word prediction requiring a broad discourse context

We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task.

# Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences

Attention is a commonly used mechanism in sequence processing, but it is of O(n^2) complexity which prevents its application to long sequences.

# Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget.

# Entity Tracking Improves Cloze-style Reading Comprehension

Reading comprehension tasks test the ability of models to process long-term context and remember salient information.

# Neural Shuffle-Exchange Networks -- Sequence Processing in O(n log n) Time

A key requirement in sequence to sequence processing is the modeling of long range dependencies.

# Not Enough Data? Deep Learning to the Rescue!

Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks.

# Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time

A key requirement in sequence to sequence processing is the modeling of long range dependencies.

# The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate.