Transformer-XL (meaning extra long) is a Transformer architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.
Source: Transformer-XL: Attentive Language Models Beyond a Fixed-Length ContextPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 38 | 29.23% |
Decoder | 7 | 5.38% |
Machine Translation | 6 | 4.62% |
Speech Recognition | 5 | 3.85% |
Translation | 5 | 3.85% |
Text Generation | 4 | 3.08% |
Paraphrase Identification | 3 | 2.31% |
Sentence | 3 | 2.31% |
Automatic Speech Recognition (ASR) | 3 | 2.31% |