Pre-Training with Whole Word Masking for Chinese BERT

19 Jun 2019 Yiming Cui Wanxiang Che Ting Liu Bing Qin Ziqing Yang Shijin Wang Guoping Hu

Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks. Recently, an upgraded version of BERT has been released with Whole Word Masking (WWM), which mitigate the drawbacks of masking partial WordPiece tokens in pre-training BERT... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Chinese Sentence Pair Classification BQ RoBERTa-wwm-ext-large F1 85.8 # 1
Chinese Sentence Pair Classification BQ Dev RoBERTa-wwm-ext-large F1 86.3 # 1
Sentiment Analysis ChnSentiCorp RoBERTa-wwm-ext-large F1 95.8 # 1
Sentiment Analysis ChnSentiCorp Dev RoBERTa-wwm-ext-large F1 95.8 # 1
Chinese Reading Comprehension CJRC RoBERTa-wwm-ext-large EM 62.4 # 1
F1 82.20 # 1
Chinese Reading Comprehension CJRC Dev RoBERTa-wwm-ext-large EM 62.1 # 1
F1 82.4 # 1
Chinese Reading Comprehension CMRC 2018 (Simplified Chinese) RoBERTa-wwm-ext-large EM 74.2 # 1
F1 90.6 # 1
Chinese Reading Comprehension CMRC 2018 (Simplified Chinese) Challenge RoBERTa-wwm-ext-large EM 31.5 # 1
F1 60.1 # 1
Chinese Reading Comprehension CMRC 2018 (Simplified Chinese) Dev RoBERTa-wwm-ext-large EM 68.5 # 2
F1 88.4 # 1
Chinese Reading Comprehension DRCD (Traditional Chinese) RoBERTa-wwm-ext-large EM 89.6 # 1
F1 94.5 # 1
Chinese Reading Comprehension DRCD (Traditional Chinese) Dev RoBERTa-wwm-ext-large EM 89.6 # 2
F1 94.8 # 1
Chinese Sentence Pair Classification LCQMC RoBERTa-wwm-ext-large F1 87 # 3
Chinese Sentence Pair Classification LCQMC Dev RoBERTa-wwm-ext-large F1 90.4 # 1
Chinese Document Classification THUCNews RoBERTa-wwm-ext-large F1 97.8 # 1
Chinese Document Classification THUCNews Dev RoBERTa-wwm-ext-large F1 98.3 # 1
Chinese Sentence Pair Classification XNLI RoBERTa-wwm-ext-large F1 81.2 # 1
Chinese Sentence Pair Classification XNLI Dev RoBERTa-wwm-ext-large F1 82.1 # 1

Methods used in the Paper


METHOD TYPE
Residual Connection
Skip Connections
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
Weight Decay
Regularization
GELU
Activation Functions
Dense Connections
Feedforward Networks
Adam
Stochastic Optimization
Softmax
Output Functions
Dropout
Regularization
WordPiece
Subword Segmentation
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
BERT
Language Models