Masked Language Modeling
164 papers with code • 12 benchmarks • 3 datasets
These leaderboards are used to track progress in Masked Language Modeling
LibrariesUse these libraries to find Masked Language Modeling models and implementations
Most implemented papers
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Symbolic Discovery of Optimization Algorithms
On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2. 3x.
UNITER: UNiversal Image-TExt Representation Learning
Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).
On the Cross-lingual Transferability of Monolingual Representations
This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages giving rise to deep multilingual abstractions.
REALM: Retrieval-Augmented Language Model Pre-Training
Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering.
MPNet: Masked and Permuted Pre-training for Language Understanding
Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem.
Language-agnostic BERT Sentence Embedding
While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have yet to be explored.
RealFormer: Transformer Likes Residual Attention
Transformer is the backbone of modern NLP models.
We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation. While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.