Recent years have witnessed a growing list of systems for distributed data-parallel training.
Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction.
In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks?
Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images.
This paper presents a simple yet effective approach to modeling space-time correspondences in the context of video object segmentation.
Ranked #1 on Semi-Supervised Video Object Segmentation on DAVIS 2017 (val) (using extra training data)
Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks.
Ranked #1 on Image Classification on ImageNet (using extra training data)