Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view.
Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
In particular, we demonstrate that a NeRF representation of a scene can be used to train dense object descriptors.
We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions.
In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy.
The results suggest self-training is a promising direction to aggregate labeled and unlabeled training data for learning general feature representations.
3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.
In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.
With just a small amount of robotic experience, we can further fine-tune the affordance model to achieve better results.
We benchmark these improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.
Ranked #30 on Object Detection on COCO minival
In this paper, we explore the use of Single Image Texture Translation (SITT) for data augmentation.
On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.
Ranked #2 on Open Vocabulary Object Detection on LVIS v1.0 (box APr metric)
Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.
Ranked #1 on Document Image Classification on AIP
Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84. 7% top-1 accuracy on the ImageNet benchmark while being up to 1. 64x faster in compute time than the popular EfficientNet models on TPU-v3 hardware.
Ranked #32 on Instance Segmentation on COCO minival
Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3. 6 mask AP on rare categories.
Ranked #1 on Instance Segmentation on LVIS v1.0 val
We then show that for complex real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF.
Furthermore, SpineNet is built with a uniform resource distribution over operations.
Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars.
We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses.
For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data.
Ranked #1 on Semantic Segmentation on PASCAL VOC 2012 test (using extra training data)
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames.
We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Ranked #8 on Image Classification on iNaturalist
We propose MnasFPN, a mobile-friendly search space for the detection head, and combine it with latency-aware architecture search to produce efficient object detection models.
Ranked #205 on Object Detection on COCO test-dev
Importantly, the best policy found on COCO may be transferred unchanged to other detection datasets and models to improve predictive accuracy.
Ranked #62 on Object Detection on COCO test-dev (using extra training data)
Here we aim to learn a better architecture of feature pyramid network for object detection.
However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required.
We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss.
Ranked #2 on Long-tail Learning on EGTEA
This lack of success of dropout for convolutional layers is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout.
Ranked #514 on Image Classification on ImageNet
Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Ranked #3 on Long-tail Learning on EGTEA
Metric learning algorithms produce distance metrics that capture the important relationships among data.
Ranked #1 on Recommendation Systems on MovieLens 20M (Recall@100 metric)
Feature pyramids are a basic component in recognition systems for detecting objects at different scales.
Ranked #3 on Pedestrian Detection on TJU-Ped-campus
To address these challenges, we test three modifications to the standard Fast R-CNN object detector: (1) skip connections that give the detector access to features at multiple network layers, (2) a foveal structure to exploit object context at multiple object resolutions, and (3) an integral loss function and corresponding network adjustment that improve localization.
Ranked #67 on Instance Segmentation on COCO test-dev
In this work we propose to augment feedforward nets for object segmentation with a novel top-down refinement approach.
Ranked #4 on Region Proposal on COCO test-dev
Most approaches predict the location of a query image by matching to ground-level images with known locations (e. g., street-view data).
In this paper we describe the Microsoft COCO Caption dataset and evaluation server.
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding.
On the other hand, there is no shortage of visual and geographic data that densely covers the Earth we examine overhead imagery and land cover survey data but the relationship between this data and ground level query photographs is complex.