Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
Ranked #1 on Semantic Segmentation on NYU Depth v2
The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing.
Model pre-training is a cornerstone of modern visual recognition systems.
Ranked #1 on Image Classification on Places365-Standard (using extra training data)
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.
Ranked #1 on Domain Generalization on ImageNet-Sketch (using extra training data)
Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate.
The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing.
The International Symposium on Biomedical Imaging (ISBI) held a grand challenge to evaluate computational systems for the automated detection of metastatic breast cancer in whole slide images of sentinel lymph node biopsies.
A natural source for such attributes is the StyleSpace of StyleGAN, which is known to generate semantically meaningful dimensions in the image.
Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks.
Ranked #1 on Image Classification on ImageNet (using extra training data)