In this paper, we explore the open-domain sketch-to-photo translation, which aims to synthesize a realistic photo from a freehand sketch with its class label, even if the sketches of that class are missing in the training data.
These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought.
Ranked #323 on Image Classification on ImageNet
Vision-and-Language Pretraining (VLP) has improved performance on various joint vision-and-language downstream tasks.
We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime.
Ranked #2 on Image Classification on ImageNet V2
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets).
Ranked #1 on Copy Detection on Copydays strong subset
Transformer is the backbone of modern NLP models.
Ranked #2 on Paraphrase Identification on Quora Question Pairs
Attention mechanisms, especially self-attention, play an increasingly important role in deep feature representation in visual tasks.
We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers.
Ranked #246 on Image Classification on ImageNet
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.
Ranked #1 on Fine-Grained Image Classification on Oxford-IIIT Pets (using extra training data)