Language Modelling

4455 papers with code • 51 benchmarks • 157 datasets

Language Modeling is the task of predicting the next word or character in a document. This technique can be used to train language models that can further be applied to a wide range of natural language tasks like text generation, text classification, and question answering.

Historically, language modelling was done with N-gram language models (which still have niche uses), but since the 2010s neural language models took over, and starting from the 2020s SOTA was achieved exclusively with large language models (LLMs).

A model's language modeling capability is measured using cross-entropy and perplexity. Some datasets to evaluate language modeling are WikiText-103, One Billion Word, Text8, C4, The Pile, among others.

Some notable state-of-the-art language models include:

GPT-3
BERT

Check below for all state-of-the-art models.

Here are some additional readings to go deeper on the task:

Language Modeling - Lena Voita

( Image credit: Exploring the Limits of Language Modeling )

Benchmarks

Add a Result

These leaderboards are used to track progress in Language Modelling

Dataset	Best Model	Compare
WikiText-103	RETRO (7.5B)	See all
Penn Treebank (Word Level)	GPT-3 (Zero-Shot)	See all
enwik8	GPT-2 (48 layers, h=1600)	See all
WikiText-2	SparseGPT (175B, 50% Sparsity)	See all
LAMBADA	PaLM-540B (Few-Shot)	See all
Text8	GPT-2	See all
One Billion Word	OmniNetT (Large)	See all
The Pile	GLM-130B	See all
Penn Treebank (Character Level)	Mogrifier LSTM + dynamic eval	See all
Hutter Prize	Transformer-XL + RMS dynamic eval	See all
C4	Primer	See all
Wiki-40B	FLASH-Quad-8k	See all
BIG-bench-lite	GLM-130B (3-shot)	See all
FewCLUE (EPRSTMT)	GLM-130B	See all
FewCLUE (OCNLI-FC)	GLM-130B	See all
FewCLUE (BUSTM)	GLM-130B	See all
FewCLUE (CHID-FC)	GLM-130B	See all
FewCLUE (CLUEWSC-FC)	GLM-130B	See all
CLUE (C3)	GLM-130B	See all
CLUE (WSC1.1)	GLM-130B	See all
CLUE (CMNLI)	GLM-130B	See all
CLUE (DRCD)	GLM-130B	See all
CLUE (OCNLI_50K)	GLM-130B	See all
CLUE (AFQMC)	GLM-130B	See all
CLUE (CMRC2018)	GLM-130B	See all
VietMed	Hybrid 4-gram VietMed-Train + ExtraText	See all
enwiki8	PAR Transformer 24B	See all
PTB Diagnostic ECG Database	I-DARTS	See all
Text8 dev	Transformer-LS (small)	See all
enwik8 dev	Transformer-LS (small)	See all
PubMed Cognitive Control Abstracts	Gopher	See all
DM Mathematics	Gopher	See all
Ubuntu IRC	Gopher	See all
OpenSubtitles	Gopher	See all
OpenWebtext2	Gopher	See all
HackerNews	Gopher	See all
Books3	Gopher	See all
Bookcorpus2	Gopher	See all
Pile CC	Gopher	See all
PhilPapers	Gopher	See all
Gutenberg PG-19	Gopher	See all
Arxiv HEP-TH citation graph	Gopher	See all
StackExchange	Gopher	See all
NIH ExPorter	Gopher	See all
USPTO Backgrounds	Gopher	See all
PubMed Central	Gopher	See all
FreeLaw	Gopher	See all
Curation Corpus	Gopher	See all
GitHub	Gopher	See all
100 sleep nights of 8 caregivers	Gpt3	See all
language-modeling-recommendation	GPT2	See all

Show all 51 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Language Modelling models and implementations

huggingface/transformers

31 papers

124,889

faceonlive/ai-research

29 papers

144

microsoft/unilm

12 papers

18,315

pytorch/fairseq

10 papers

29,233

See all 15 libraries.

Datasets

Subtasks

Sentence Pair Modeling

Cross-Document Language Modeling

Controllable Language Modelling

Most implemented papers

Most implemented Social Latest No code

XLNet: Generalized Autoregressive Pretraining for Language Understanding

zihangdai/xlnet • • NeurIPS 2019

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling.

Paper
Code

Longformer: The Long-Document Transformer

allenai/longformer • • 10 Apr 2020

To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

Paper
Code

Recurrent Neural Network Regularization

wojzaremba/lstm • 8 Sep 2014

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

Paper
Code

On the Variance of the Adaptive Learning Rate and Beyond

LiyuanLucasLiu/RAdam • • ICLR 2020

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam.

Paper
Code

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

EleutherAI/The-Pile • 31 Dec 2020

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models.

Paper
Code

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

dmis-lab/biobert • • 25 Jan 2019

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows.

Paper
Code

Variational Autoencoders for Collaborative Filtering

dawenl/vae_cf • 16 Feb 2018

This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research. We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation.

Paper
Code

Generating Sentences from a Continuous Space

PaddlePaddle/PaddleNLP • • CONLL 2016

The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation.

Paper
Code

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

google-research/electra • • ICLR 2020

Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.

Paper
Code

Cross-lingual Language Model Pretraining

huggingface/transformers • • NeurIPS 2019

On unsupervised machine translation, we obtain 34. 3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU.

Paper
Code

Language Modelling

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result