Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block.
Ranked #22 on Image Classification on ImageNet
We propose Lite Vision Transformer (LVT), a novel light-weight transformer network with two enhanced self-attention mechanisms to improve the model performances for mobile deployment.
Self-Attention has become prevalent in computer vision models.
To evaluate segmentation quality near object boundaries, we propose the Meticulosity Quality (MQ) score considering both the mask coverage and boundary precision.
Second, we find that compositional deep networks, which have part-based representations that lead to innate robustness to natural occlusion, are robust to patch attacks on PASCAL3D+ and the German Traffic Sign Recognition Benchmark, without adversarial training.
PatchAttack induces misclassifications by superimposing small textured patches on the input image.
Optimizing a deep neural network is a fundamental task in computer vision, yet direct training methods often suffer from over-fitting.
We focus on the problem of training a deep neural network in generations.