Bort is a parametric architectural variant of the BERT architecture. It extracts an optimal subset of architectural parameters for the BERT architecture through a neural architecture search approach; in particular, a fully polynomial-time approximation scheme (FPTAS). This optimal subset - “Bort” - is demonstrably smaller, having an effective size of $5.5 \%$ the original BERT-large architecture, and $16\%$ of the net size. Bort is also able to be pretrained in $288$ GPU hours, which is $1.2\%$ less than the time required to pretrain the highest-performing BERT parametric architecture variant, RoBERTa-large (RoBERTa), and about $33\%
Source: Optimal Subarchitecture Extraction For BERTPaper | Code | Results | Date | Stars |
---|