Transformer-XL (meaning extra long) is a Transformer architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.
Source: Transformer-XL: Attentive Language Models Beyond a Fixed-Length ContextPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 37 | 38.14% |
Machine Translation | 6 | 6.19% |
Speech Recognition | 5 | 5.15% |
Paraphrase Identification | 3 | 3.09% |
Text Generation | 3 | 3.09% |
Automatic Speech Recognition (ASR) | 3 | 3.09% |
Reinforcement Learning (RL) | 2 | 2.06% |
Music Generation | 2 | 2.06% |
Abstractive Text Summarization | 2 | 2.06% |