In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks?
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.
Ranked #1 on Image Classification on CIFAR-10 (using extra training data)
We propose a method that, given a single image with its estimated background, outputs the object's appearance and position in a series of sub-frames as if captured by a high-speed camera (i. e. temporal super-resolution).
Ranked #1 on Video Super-Resolution on Falling Objects
We present a novel method for local image feature matching.
We formulate it as a multi-agent reinforcement learning (MARL) problem, where each agent learns an augmentation policy for each patch based on its content together with the semantics of the whole image.
Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers.
Ranked #3 on Image Captioning on COCO
We also find that mixup reduces the memorization of corrupt labels, increases the robustness to adversarial examples, and stabilizes the training of generative adversarial networks.
Ranked #7 on Domain Generalization on ImageNet-A
This work presents Kornia, an open source computer vision library built upon a set of differentiable routines and modules that aims to solve generic computer vision problems.
We introduce a novel loss for learning local feature descriptors which is inspired by the Lowe's matching criterion for SIFT.
This work presents Kornia -- an open source computer vision library which consists of a set of differentiable routines and modules to solve generic computer vision problems.