We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training.
Ranked #2 on Object Detection on LVIS v1.0 val
The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
We present a large-scale study on unsupervised spatiotemporal representation learning from videos.
Ranked #2 on Self-Supervised Action Recognition on HMDB51
In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT.
Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing.
Ranked #69 on Self-Supervised Image Classification on ImageNet
In this work, we present a new network design paradigm.
Existing neural network architectures in computer vision -- whether designed by humans or by machines -- were typically found using both images and their associated labels.
Contrastive unsupervised learning has recently shown encouraging progress, e. g., in Momentum Contrast (MoCo) and SimCLR.
Ranked #3 on Contrastive Learning on imagenet-1k
We present a new method for efficient high-quality image segmentation of objects and scenes.
Ranked #3 on Instance Segmentation on COCO 2017 val
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Ranked #1 on Video Classification on Charades
This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
Ranked #11 on Contrastive Learning on imagenet-1k
Current 3D object detection methods are heavily influenced by 2D detectors.
Ranked #11 on 3D Object Detection on SUN-RGBD val
In this paper, we explore a more diverse set of connectivity patterns through the lens of randomly wired neural networks.
Ranked #114 on Neural Architecture Search on ImageNet
To formalize this, we treat dense instance segmentation as a prediction task over 4D tensors and present a general framework called TensorMask that explicitly captures this geometry and enables novel operators on 4D tensors.
Ranked #76 on Instance Segmentation on COCO test-dev
In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks.
Ranked #4 on Panoptic Segmentation on KITTI Panoptic Segmentation
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
Ranked #4 on Action Recognition on AVA v2.1
This study suggests that adversarial perturbations on images lead to noise in the features constructed by these networks.
We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden units), or embedding-free units such as image pixels.
We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization.
Ranked #64 on Object Detection on COCO minival
We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden unit), or embedding-free units such as image pixels.
ImageNet classification is the de facto pretraining task for these models.
Ranked #185 on Image Classification on ImageNet
We propose and study a task we name panoptic segmentation (PS).
Ranked #21 on Panoptic Segmentation on Cityscapes val (using extra training data)
We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data.
Most methods for object instance segmentation require all training examples to be labeled with segmentation masks.
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.
Ranked #8 on Action Classification on Toyota Smarthome dataset (using extra training data)
The objects are connected by two types of edges which correspond to two types of invariance: "different instances but a similar viewpoint and category" and "different viewpoints of the same instance".
Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Ranked #3 on Long-tail Learning on EGTEA
To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training.
Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with.
Ranked #38 on Human-Object Interaction Detection on HICO-DET
Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
Ranked #1 on Keypoint Estimation on GRIT
Feature pyramids are a basic component in recognition systems for detecting objects at different scales.
Ranked #3 on Pedestrian Detection on TJU-Ped-campus
Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set.
Ranked #3 on Image Classification on GasHisSDB
Detecting pedestrian has been arguably addressed as a special topic beyond general object detection.
Ranked #17 on Pedestrian Detection on Caltech
In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image.
Ranked #4 on Real-Time Object Detection on PASCAL VOC 2007
Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks is a tedious and inefficient procedure.
In contrast to the previous FCN that generates one score map, our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances.
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors.
Ranked #16 on Image Classification on Kuzushiji-MNIST
We develop an algorithm for the nontrivial end-to-end training of this causal, cascaded structure.
Ranked #3 on Multi-Human Parsing on PASCAL-Part
Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
Ranked #5 on Real-Time Object Detection on PASCAL VOC 2007
The image projections will turn the straight lines into curved "geodesic lines", and it is fundamentally impossible to keep all these lines straight.
This paper aims to accelerate the test-time computation of convolutional neural networks (CNNs), especially very deep CNNs that have substantially impacted the computer vision community.
We discover that aside from deep feature maps, a deep and convolutional per-region classifier is of particular importance for object detection, whereas latest superior image classification models (such as ResNets and GoogLeNets) do not directly lead to good detection accuracy without using such a per-region classifier.
Recent leading approaches to semantic segmentation rely on deep convolutional networks trained with human-annotated, pixel-level segmentation masks.
Ranked #46 on Semantic Segmentation on PASCAL VOC 2012 test
In this work, we study rectifier neural networks for image classification from two aspects.
We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network.
Ranked #2 on Video Super-Resolution on Xiph HD - 4x upscaling
The current leading approaches for semantic segmentation exploit shape information by extracting CNN features from masked image regions.
Ranked #59 on Semantic Segmentation on PASCAL Context
This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale.
Ranked #24 on Object Detection on PASCAL VOC 2007
We propose a novel Affinity-Preserving K-means algorithm which simultaneously performs k-means clustering and learns the binary indices of the quantized cells.
Product quantization is an effective vector quantization approach to compactly encode high-dimensional vectors for fast approximate nearest neighbor (ANN) search.