SpanBERT: Improving Pre-training by Representing and Predicting Spans

We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it... (read more)

PDF Abstract TACL 2020 PDF TACL 2020 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Linguistic Acceptability CoLA SpanBERT Accuracy 64.3% # 11
Question Answering HotpotQA SpanBERT Joint F1 83 # 1
Semantic Textual Similarity MRPC SpanBERT Accuracy 90.9% # 6
Natural Language Inference MultiNLI SpanBERT Matched 88.1 # 11
Question Answering NaturalQA SpanBERT F1 82.5 # 1
Question Answering NewsQA SpanBERT F1 73.6 # 1
Coreference Resolution OntoNotes SpanBERT F1 79.6 # 1
Natural Language Inference QNLI SpanBERT Accuracy 94.3% # 13
Paraphrase Identification Quora Question Pairs SpanBERT Accuracy 89.5 # 6
F1 71.9 # 7
Relation Extraction Re-TACRED SpanBERT F1 85.3 # 2
Natural Language Inference RTE SpanBERT Accuracy 79.0% # 14
Open-Domain Question Answering SearchQA SpanBERT F1 84.8 # 1
Question Answering SQuAD1.1 SpanBERT (single model) EM 88.8 # 11
F1 94.6 # 9
Question Answering SQuAD2.0 SpanBERT EM 85.7 # 112
F1 88.7 # 108
Question Answering SQuAD2.0 dev SpanBERT F1 86.8 # 6
Sentiment Analysis SST-2 Binary classification SpanBERT Accuracy 94.8 # 20
Semantic Textual Similarity STS Benchmark SpanBERT Pearson Correlation 0.899 # 15
Relation Extraction TACRED SpanBERT-large F1 70.8 # 10
Question Answering TriviaQA SpanBERT F1 83.6 # 1

Methods used in the Paper


METHOD TYPE
Residual Connection
Skip Connections
Attention Dropout
Regularization
Linear Warmup With Linear Decay
Learning Rate Schedules
Weight Decay
Regularization
GELU
Activation Functions
Dense Connections
Feedforward Networks
Adam
Stochastic Optimization
WordPiece
Subword Segmentation
Softmax
Output Functions
Dropout
Regularization
Multi-Head Attention
Attention Modules
Layer Normalization
Normalization
Scaled Dot-Product Attention
Attention Mechanisms
BERT
Language Models