Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand.
Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited.
We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer.
Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation.
This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.
Ranked #141 on Image Classification on CIFAR-10
We consider the problem of path inference: given a path prefix, i. e., a partially observed sequence of nodes in a graph, we want to predict which nodes are in the missing suffix.
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i. e. algorithms that leverage the compute power of many devices for training.