Transformer-XL

Introduced by Dai et al. in Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL (meaning extra long) is a Transformer architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.

Source: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	37	33.64%
Machine Translation	6	5.45%
Speech Recognition	5	4.55%
Translation	5	4.55%
Paraphrase Identification	3	2.73%
Sentence	3	2.73%
Text Generation	3	2.73%
Automatic Speech Recognition (ASR)	3	2.73%
Reinforcement Learning (RL)	2	1.82%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Adam	Stochastic Optimization
Adaptive Input Representations	Input Embedding Factorization
Adaptive Softmax	Output Functions
Dense Connections	Feedforward Networks
Dropout	Regularization
Layer Normalization	Normalization
Linear Warmup With Cosine Annealing	Learning Rate Schedules
Multi-Head Attention	Attention Modules
ReLU	Activation Functions
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Variational Dropout	Regularization

Categories

Add Remove

Transformers

Autoregressive Transformers