This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training.
Ranked #2 on Object Detection on LVIS v1.0 val
Model pre-training is a cornerstone of modern visual recognition systems.
Ranked #1 on Out-of-Distribution Generalization on ImageNet-W (using extra training data)
The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive.
1 code implementation • 18 Nov 2021 • Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer
We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Ranked #1 on Out-of-Distribution Generalization on ImageNet-W
To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions.
We present a large-scale study on unsupervised spatiotemporal representation learning from videos.
Ranked #2 on Self-Supervised Action Recognition on HMDB51
We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects.
On one hand, this is desirable as it treats all classes equally.
no code implementations • 16 May 2020 • Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdel-rahman Mohamed
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems.
In this work, we present a new network design paradigm.
Ranked #1 on Out-of-Distribution Generalization on ImageNet-W
Existing neural network architectures in computer vision -- whether designed by humans or by machines -- were typically found using both images and their associated labels.
Contrastive unsupervised learning has recently shown encouraging progress, e. g., in Momentum Contrast (MoCo) and SimCLR.
Ranked #3 on Contrastive Learning on imagenet-1k
We present a new method for efficient high-quality image segmentation of objects and scenes.
Ranked #3 on Instance Segmentation on COCO 2017 val
We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).
Ranked #1 on Video Classification on Charades
This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
Ranked #11 on Contrastive Learning on imagenet-1k
no code implementations • 27 Oct 2019 • Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data.
The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles.
Ranked #3 on Visual Reasoning on PHYRE-1B-Within
We plan to collect ~2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images.
In this paper, we explore a more diverse set of connectivity patterns through the lens of randomly wired neural networks.
Ranked #114 on Neural Architecture Search on ImageNet
To formalize this, we treat dense instance segmentation as a prediction task over 4D tensors and present a general framework called TensorMask that explicitly captures this geometry and enables novel operators on 4D tensors.
Ranked #76 on Instance Segmentation on COCO test-dev
In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks.
Ranked #4 on Panoptic Segmentation on KITTI Panoptic Segmentation
To understand the world, we humans constantly need to relate the present to the past, and put events in context.
Ranked #4 on Action Recognition on AVA v2.1
We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization.
Ranked #64 on Object Detection on COCO minival
ImageNet classification is the de facto pretraining task for these models.
Ranked #185 on Image Classification on ImageNet
Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views.
We propose and study a task we name panoptic segmentation (PS).
Ranked #21 on Panoptic Segmentation on Cityscapes val (using extra training data)
We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data.
We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.
Most methods for object instance segmentation require all training examples to be labeled with segmentation masks.
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.
Ranked #8 on Action Classification on Toyota Smarthome dataset (using extra training data)
Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Ranked #3 on Long-tail Learning on EGTEA
To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training.
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes.
Ranked #5 on Visual Question Answering on CLEVR-Humans
Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with.
Ranked #38 on Human-Object Interaction Detection on HICO-DET
Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
Ranked #1 on Keypoint Estimation on GRIT
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings.
Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature.
Feature pyramids are a basic component in recognition systems for detecting objects at different scales.
Ranked #3 on Pedestrian Detection on TJU-Ped-campus
Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set.
Ranked #3 on Image Classification on GasHisSDB
Low-shot visual learning---the ability to recognize novel object categories from very few examples---is a hallmark of human visual intelligence.
1 code implementation • • Ting-Hao, Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling.
As 3D movie viewing becomes mainstream and Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly.
Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples.
Ranked #6 on Face Verification on Trillion Pairs Dataset
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention.
In this paper we present the Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest.
Ranked #225 on Object Detection on COCO test-dev
Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms.
Ranked #4 on Image Clustering on CMU-PIE
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
Ranked #1 on Real-Time Object Detection on PASCAL VOC 2007
In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
Ranked #5 on Real-Time Object Detection on PASCAL VOC 2007
In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition system.
Ranked #4 on Weakly Supervised Object Detection on HICO-DET
We discover that aside from deep feature maps, a deep and convolutional per-region classifier is of particular importance for object detection, whereas latest superior image classification models (such as ResNets and GoogLeNets) do not directly lead to good detection accuracy without using such a per-region classifier.
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as feature representation.
Deformable part models (DPMs) and convolutional neural networks (CNNs) are two widely used tools for visual recognition.
Ranked #26 on Object Detection on PASCAL VOC 2007
In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features.
Ranked #6 on Object Detection In Indoor Scenes on SUN RGB-D
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories.
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts.
Ranked #59 on Fine-Grained Image Classification on CUB-200-2011
In the last two years, convolutional neural networks (CNNs) have achieved an impressive suite of results on standard recognition datasets and tasks.
Unlike classical semantic segmentation, we require individual object instances.
Ranked #3 on Object Detection on PASCAL VOC 2012
The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
We present convolutional neural networks for the tasks of keypoint (pose) prediction and action classification of people in unconstrained images.
no code implementations • • Andrea Vedaldi, Siddharth Mahendran, Stavros Tsogkas, Subhransu Maji, Ross Girshick, Juho Kannala, Esa Rahtu, Iasonas Kokkinos, Matthew B. Blaschko, David Weiss, Ben Taskar, Karen Simonyan, Naomi Saphra, Sammy Mohamed
We show that the collected data can be used to study the relation between part detection and attribute prediction by diagnosing the performance of classifiers that pool information from different parts of an object.
A k-poselet is a deformable part model (DPM) with k parts, where each of the parts is a poselet, aligned to a specific configuration of keypoints based on ground-truth annotations.
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding.
Convolutional Neural Networks (CNNs) can provide accurate object classification.
Learning to localize objects with minimal supervision is an important problem in computer vision, since large fully annotated datasets are extremely costly to obtain.
Ranked #35 on Weakly Supervised Object Detection on PASCAL VOC 2007
We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset.
Ranked #25 on Object Detection on PASCAL VOC 2007