BERT-based Masked Language Model

Last updated on Mar 15, 2021

BERT-based Masked Language Model

Parameters 131 Million
File Size 464.07 MB
Training Data Wikipedia, BookCorpus

Training Techniques SGD
Architecture BERT, Dropout, Layer Normalization, Linear Layer, Tanh
LR 0.01
Epochs 1


The MaskedLanguageModel embeds some input tokens (including some which are masked), contextualizes them, then predicts targets for the masked tokens, computing a loss against known targets.

Explore live Masked Language Modeling demo at AllenNLP.

How do I load this model?

from allennlp_models.pretrained import load_predictor
predictor = load_predictor("lm-masked-language-model")

Getting predictions

sentence = "I really like %s, especially %s."
preds = predictor.predict(sentence % ("[MASK]", "[MASK]"))

for pair in zip(*preds["words"]):
    print(sentence % pair)
# prints:
# I really like you, especially you.
# I really like him, especially now.
# I really like her, especially her.
# I really like them, especially him.
# I really like people, especially me.

You can also get predictions using allennlp command line interface:

echo '{"sentence": "I really like [MASK], especially [MASK]."}' | \
    allennlp predict -

How do I train this model?

To train this model you can use allennlp CLI tool and the configuration file bidirectional_language_model.jsonnet:

allennlp train bidirectional_language_model.jsonnet -s output_dir

See the AllenNLP Training and prediction guide for more details.


 author = {J. Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
 booktitle = {NAACL-HLT},
 title = {BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
 year = {2019}