We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together.
Ranked #2 on Zero-shot Audio Classification on AudioSet (using extra training data)
6 code implementations • 14 Apr 2023 • Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision.
Ranked #1 on Image Retrieval on AmsterTime (using extra training data)
For example, for segmentation mIoU is reduced from 44. 5 to 30. 5 mIoU when compressing to 0. 1 bpp using the best compression model we evaluated.
Lossy image compression aims to represent images in as few bits as possible while maintaining fidelity to the original.
In this work, we attempt to bring these lines of research closer by revisiting vector quantization for image compression.
Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures.
(2) Fine-tuning the weights of the attention layers is sufficient to adapt vision transformers to a higher resolution and to other classification tasks.
Ranked #8 on Image Classification on CIFAR-10
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
Ranked #37 on Semantic Segmentation on ADE20K val
Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings. We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains.
11 code implementations • • Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries.
Ranked #54 on Instance Segmentation on COCO minival
16 code implementations • • Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification.
Ranked #5 on Image Classification on ImageNet ReaL (Top 1 Accuracy metric)
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime.
Ranked #11 on Image Classification on iNaturalist 2019
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification.
Deep neural networks require collecting and annotating large amounts of data to train successfully.
Ranked #43 on Self-Supervised Action Recognition on UCF101
Conditional text-to-image generation is an active area of research, with many possible applications.
Ranked #2 on Text-to-Image Generation on GeNeVA (i-CLEVR)
Finally, for better network initialization, we transfer from the task of action recognition to action detection by pre-training our framework using the recently released large-scale Kinetics dataset.