Language Modelling

4482 papers with code • 51 benchmarks • 157 datasets

Language Modeling is the task of predicting the next word or character in a document. This technique can be used to train language models that can further be applied to a wide range of natural language tasks like text generation, text classification, and question answering.

Historically, language modelling was done with N-gram language models (which still have niche uses), but since the 2010s neural language models took over, and starting from the 2020s SOTA was achieved exclusively with large language models (LLMs).

A model's language modeling capability is measured using cross-entropy and perplexity. Some datasets to evaluate language modeling are WikiText-103, One Billion Word, Text8, C4, The Pile, among others.

Some notable state-of-the-art language models include:

GPT-3
BERT

Check below for all state-of-the-art models.

Here are some additional readings to go deeper on the task:

Language Modeling - Lena Voita

( Image credit: Exploring the Limits of Language Modeling )

Benchmarks

Add a Result

These leaderboards are used to track progress in Language Modelling

Dataset	Best Model	Compare
WikiText-103	RETRO (7.5B)	See all
Penn Treebank (Word Level)	GPT-3 (Zero-Shot)	See all
enwik8	GPT-2 (48 layers, h=1600)	See all
WikiText-2	SparseGPT (175B, 50% Sparsity)	See all
LAMBADA	PaLM-540B (Few-Shot)	See all
Text8	GPT-2	See all
One Billion Word	OmniNetT (Large)	See all
The Pile	GLM-130B	See all
Penn Treebank (Character Level)	Mogrifier LSTM + dynamic eval	See all
Hutter Prize	Transformer-XL + RMS dynamic eval	See all
C4	Primer	See all
Wiki-40B	FLASH-Quad-8k	See all
BIG-bench-lite	GLM-130B (3-shot)	See all
FewCLUE (EPRSTMT)	GLM-130B	See all
FewCLUE (OCNLI-FC)	GLM-130B	See all
FewCLUE (BUSTM)	GLM-130B	See all
FewCLUE (CHID-FC)	GLM-130B	See all
FewCLUE (CLUEWSC-FC)	GLM-130B	See all
CLUE (C3)	GLM-130B	See all
CLUE (WSC1.1)	GLM-130B	See all
CLUE (CMNLI)	GLM-130B	See all
CLUE (DRCD)	GLM-130B	See all
CLUE (OCNLI_50K)	GLM-130B	See all
CLUE (AFQMC)	GLM-130B	See all
CLUE (CMRC2018)	GLM-130B	See all
VietMed	Hybrid 4-gram VietMed-Train + ExtraText	See all
enwiki8	PAR Transformer 24B	See all
PTB Diagnostic ECG Database	I-DARTS	See all
Text8 dev	Transformer-LS (small)	See all
enwik8 dev	Transformer-LS (small)	See all
PubMed Cognitive Control Abstracts	Gopher	See all
DM Mathematics	Gopher	See all
Ubuntu IRC	Gopher	See all
OpenSubtitles	Gopher	See all
OpenWebtext2	Gopher	See all
HackerNews	Gopher	See all
Books3	Gopher	See all
Bookcorpus2	Gopher	See all
Pile CC	Gopher	See all
PhilPapers	Gopher	See all
Gutenberg PG-19	Gopher	See all
Arxiv HEP-TH citation graph	Gopher	See all
StackExchange	Gopher	See all
NIH ExPorter	Gopher	See all
USPTO Backgrounds	Gopher	See all
PubMed Central	Gopher	See all
FreeLaw	Gopher	See all
Curation Corpus	Gopher	See all
GitHub	Gopher	See all
100 sleep nights of 8 caregivers	Gpt3	See all
language-modeling-recommendation	GPT2	See all

Show all 51 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Language Modelling models and implementations

huggingface/transformers

30 papers

125,059

faceonlive/ai-research

29 papers

152

microsoft/unilm

12 papers

18,335

pytorch/fairseq

10 papers

29,255

See all 15 libraries.

Datasets

Subtasks

Sentence Pair Modeling

Cross-Document Language Modeling

Controllable Language Modelling

Most implemented papers

Most implemented Social Latest No code

Semi-supervised Sequence Learning

tensorflow/models • • NeurIPS 2015

In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better.

162

Paper
Code

Universal Language Model Fine-tuning for Text Classification

fastai/fastai • • ACL 2018

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.

Paper
Code

RoBERTa: A Robustly Optimized BERT Pretraining Approach

pytorch/fairseq • • 26 Jul 2019

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging.

Paper
Code

Generating Sequences With Recurrent Neural Networks

karpathy/char-rnn • • 4 Aug 2013

This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time.

Paper
Code

DARTS: Differentiable Architecture Search

quark0/darts • • ICLR 2019

This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner.

Paper
Code

Language Models are Few-Shot Learners

openai/gpt-3 • NeurIPS 2020

By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do.

Paper
Code

LoRA: Low-Rank Adaptation of Large Language Models

tatsu-lab/stanford_alpaca • • ICLR 2022

We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

Paper
Code

Regularizing and Optimizing LSTM Language Models

salesforce/awd-lstm-lm • • ICLR 2018

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering.

Paper
Code

Deep contextualized word representations

flairNLP/flair • • NAACL 2018

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e. g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i. e., to model polysemy).

Paper
Code

End-To-End Memory Networks

facebook/MemNN • • NeurIPS 2015

For the former our approach is competitive with Memory Networks, but with less supervision.

Paper
Code

Language Modelling

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result