In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy.
In this paper, we bring them together and introduce the task of unified scene text detection and layout analysis.
Modern self-supervised learning algorithms typically enforce persistency of instance representations across views.
A common practice in transfer learning is to initialize the downstream model weights by pre-training on a data-abundant upstream task.
We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself.
Ranked #16 on
Sentiment Analysis
on IMDb
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes.
Ranked #7 on
Retinal OCT Disease Classification
on OCT2017
In our implementation, we have designed a search space where a policy consists of many sub-policies, one of which is randomly chosen for each image in each mini-batch.
Ranked #4 on
Data Augmentation
on ImageNet
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors.
Ranked #16 on
Image Classification
on Kuzushiji-MNIST
The accumulated belief of the world enables the agent to track visited regions of the environment.
We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss.