Language Modelling

4455 papers with code • 51 benchmarks • 157 datasets

Language Modeling is the task of predicting the next word or character in a document. This technique can be used to train language models that can further be applied to a wide range of natural language tasks like text generation, text classification, and question answering.

Historically, language modelling was done with N-gram language models (which still have niche uses), but since the 2010s neural language models took over, and starting from the 2020s SOTA was achieved exclusively with large language models (LLMs).

A model's language modeling capability is measured using cross-entropy and perplexity. Some datasets to evaluate language modeling are WikiText-103, One Billion Word, Text8, C4, The Pile, among others.

Some notable state-of-the-art language models include:

Check below for all state-of-the-art models.

Here are some additional readings to go deeper on the task:

( Image credit: Exploring the Limits of Language Modeling )

Libraries

Use these libraries to find Language Modelling models and implementations
31 papers
124,889
12 papers
18,315
10 papers
29,233
See all 15 libraries.

Most implemented papers

XLNet: Generalized Autoregressive Pretraining for Language Understanding

zihangdai/xlnet NeurIPS 2019

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling.

Longformer: The Long-Document Transformer

allenai/longformer 10 Apr 2020

To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

Recurrent Neural Network Regularization

wojzaremba/lstm 8 Sep 2014

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

On the Variance of the Adaptive Learning Rate and Beyond

LiyuanLucasLiu/RAdam ICLR 2020

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

EleutherAI/The-Pile 31 Dec 2020

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

dmis-lab/biobert 25 Jan 2019

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows.

Variational Autoencoders for Collaborative Filtering

dawenl/vae_cf 16 Feb 2018

This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research. We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation.

Generating Sentences from a Continuous Space

PaddlePaddle/PaddleNLP CONLL 2016

The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

google-research/electra ICLR 2020

Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.

Cross-lingual Language Model Pretraining

huggingface/transformers NeurIPS 2019

On unsupervised machine translation, we obtain 34. 3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU.