Attention Modules

DeLighT Block

Introduced by Mehta et al. in DeLighT: Deep and Light-weight Transformer

A DeLighT Block is a block used in the DeLighT transformer architecture. It uses a DExTra transformation to reduce the dimensionality of the vectors entered into the attention layer, where a single-headed attention module is used. Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace multi-head attention with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer.

Source: DeLighT: Deep and Light-weight Transformer


Paper Code Results Date Stars


Task Papers Share
Language Modelling 1 50.00%
Machine Translation 1 50.00%