Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT.
Ranked #1 on Image Classification on ImageNet ReaL (Number of params metric)
(2) Fine-tuning the weights of the attention layers is sufficient to adapt vision transformers to a higher resolution and to other classification tasks.
Ranked #6 on Image Classification on CIFAR-10 (using extra training data)
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
Ranked #26 on Semantic Segmentation on ADE20K val
We revisit watermarking techniques based on pre-trained deep networks, in the light of self-supervised approaches.
Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization.
We share competitive training settings and pre-trained models in the timm open-source library, with the hope that they will serve as better baselines for future work.
Ranked #8 on Image Classification on iNaturalist 2019
16 code implementations • • Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou
We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification.
Ranked #5 on Image Classification on ImageNet ReaL (Top 1 Accuracy metric)
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).
Ranked #1 on Video Object Segmentation on DAVIS 2017 (J&F metric)
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime.
Ranked #9 on Image Classification on iNaturalist 2019
In particular, we investigate the interplay of architecture and optimization of such dedicated transformers.
Ranked #1 on on
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification.
In this work, we produce a competitive convolution-free transformer by training on Imagenet only.
Ranked #3 on Document Layout Analysis on PubLayNet val
By jointly leveraging the coarse labels and the underlying fine-grained latent space, it significantly improves the accuracy of category-level retrieval methods.
Ranked #2 on Image Classification on iNaturalist 2019
We propose a simple architecture to address unpaired image-to-image translation tasks: style or class transfer, denoising, deblurring, deblocking, etc.
Ranked #1 on Image-to-Image Translation on horse2zebra (Frechet Inception Distance metric)
An EfficientNet-L2 pre-trained with weak supervision on 300M unlabeled images and further optimized with FixRes achieves 88. 5% top-1 accuracy (top-5: 98. 7%), which establishes the new state of the art for ImageNet with a single crop.
Ranked #8 on Image Classification on ImageNet ReaL (using extra training data)
Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.
In this paper, we address the problem of reducing the memory footprint of convolutional network architectures.
In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture.
Conversely, when training a ResNeXt-101 32x48d pre-trained in weakly-supervised fashion on 940 million public images at resolution 224x224 and further optimizing for test resolution 320x320, we obtain a test top-1 accuracy of 86. 4% (top-5: 98. 0%) (single-crop).
Ranked #2 on Fine-Grained Image Classification on Birdsnap (using extra training data)
This paper presents a study of semi-supervised learning with large convolutional networks.
Ranked #7 on Image Classification on OmniBenchmark (using extra training data)
When fed to a linear classifier, the learned embeddings provide state-of-the-art classification accuracy.
Ranked #1 on Image Retrieval on INRIA Holidays
Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.
Similarity search approaches based on graph walks have recently attained outstanding speed-accuracy trade-offs, taking aside the memory requirements.
We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation.
Ranked #2 on Word Alignment on en-es
While k-means is usually considered as the gold standard for this task, we evaluate and show the interest of diffusion methods that have been neglected by the state of the art, such as the Markov Clustering algorithm.
This paper considers the problem of inferring image labels from images when only a few annotated examples are available at training time.
Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures.
Hashing produces compact representations for documents, to perform tasks like classification or retrieval based on these short codes.
This paper tackles the task of storing a large collection of vectors, such as visual descriptors, and of searching in it.
First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets.
Recently, image representation built upon Convolutional Neural Network (CNN) has been shown to provide effective descriptors for image search, outperforming pre-CNN features as short-vector representations.
Ranked #4 on Image Retrieval on Par6k
We study an indexing architecture to store and search in a database of high-dimensional vectors from the perspective of statistical signal processing and decision theory.
Our results show that the regular dense detector is outperformed by other methods in most situations, leading us to improve the state of the art in comparable setups on standard retrieval and fined-grain benchmarks.
Our geometric-aware aggregation strategy is effective for image search, as shown by experiments performed on standard benchmarks for image and particular object retrieval, namely Holidays and Oxford buildings.