Optimal Subarchitecture Extraction For BERT

20 Oct 2020 Adrian de Wynter Daniel J. Perry

We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as "Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of $5.5\%$ the original BERT-large architecture, and $16\%$ of the net size... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper


METHOD TYPE
Dense Connections
Feedforward Networks
Multi-Head Attention
Attention Modules
Dropout
Regularization
GELU
Activation Functions
Linear Warmup With Linear Decay
Learning Rate Schedules
Attention Dropout
Regularization
Weight Decay
Regularization
Residual Connection
Skip Connections
Scaled Dot-Product Attention
Attention Mechanisms
WordPiece
Subword Segmentation
Adam
Stochastic Optimization
Softmax
Output Functions
Layer Normalization
Normalization
BERT
Language Models