ALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high.
This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away.
This method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).Source: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
|🤖 No Components Found||You can add them if they exist; e.g. Mask R-CNN uses RoIAlign|