The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.
Ranked #1 on Domain Generalization on ImageNet-Sketch (using extra training data)
The second edition of Deep Learning Interviews is home to hundreds of fully-solved problems, from a wide range of key topics in AI.
For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without fine-tuning.
We introduce Plenoxels (plenoptic voxels), a system for photorealistic view synthesis.
We present an efficient method for joint optimization of topology, materials and lighting from multi-view image observations.
We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases, each providing a unified representation of the appearance of an object (or background) over the video.
For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 9% and 71. 2% top-1 accuracy respectively.
In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks?
Ranked #379 on Image Classification on ImageNet