ALBERT

Introduced by Lan et al. in ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT is a Transformer architecture based on BERT but with much fewer parameters. It achieves this through two parameter reduction techniques. The first is a factorized embeddings parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, the size of the hidden layers is separated from the size of vocabulary embedding. This makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.

Additionally, ALBERT utilises a self-supervised loss for sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed to address the ineffectiveness of the next sentence prediction (NSP) loss proposed in the original BERT.

Source: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	28	10.22%
Sentence	23	8.39%
Text Classification	14	5.11%
Question Answering	13	4.74%
Sentiment Analysis	13	4.74%
Named Entity Recognition (NER)	8	2.92%
Reading Comprehension	7	2.55%
NER	6	2.19%
Natural Language Understanding	6	2.19%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
GELU	Activation Functions
LAMB	Large Batch Optimization
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions
WordPiece	Subword Segmentation

Categories

Add Remove

Transformers

Autoencoding Transformers