A DeLighT Block is a block used in the DeLighT transformer architecture. It uses a DExTra transformation to reduce the dimensionality of the vectors entered into the attention layer, where a single-headed attention module is used. Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace multi-head attention with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer.
Source: DeLighT: Deep and Light-weight TransformerPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 1 | 33.33% |
Machine Translation | 1 | 33.33% |
Translation | 1 | 33.33% |