Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG.
Then, by using a SIM(3)-invariant shape descriptor, we gracefully decouple the shape and pose of an object, thus supporting latent shape optimization of target objects in arbitrary poses.
Masked autoencoders have become popular training paradigms for self-supervised visual representation learning.
Ranked #1 on Self-Supervised Image Classification on ImageNet (finetuned) (using extra training data)
Can a robot autonomously learn to design and construct a bridge from varying-sized blocks without a blueprint?
With the estimated distance map, the agent could simultaneously explore the environment and navigate to the target objects based on a simple human-designed strategy.
We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer.
Ranked #1 on Unsupervised Image Classification on ImageNet
To solve this problem, we propose to maximize the mutual information between the input and the class predictions.
Ranked #1 on Image Classification on Oxford-IIIT Pet Dataset
The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces.
Separating 3D point clouds into individual instances is an important task for 3D vision.
In this task, the robot needs to first design a feasible bridge architecture for arbitrarily wide cliffs and then manipulate the blocks reliably to construct a stable bridge according to the proposed design.
In a unified framework, we jointly predict the feasible 6-DoF grasp poses, instance semantic segmentation, and collision information.
Besides instance segmentation, our method yields state-of-the-art results in object detection (from our mask byproduct) and panoptic segmentation.
In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.
We propose Scale-aware AutoAug to learn data augmentation policies for object detection.
Referring image segmentation aims to segment the objects referred by a natural language expression.
In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.
Ranked #87 on Object Detection on COCO minival
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin.
Importantly, we take one step further by dynamically learning the mask head of the object segmenter such that the mask head is conditioned on the location.
Ranked #10 on Real-time Instance Segmentation on MSCOCO
While almost all state-of-the-art object detectors utilize predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors.
We present a new, embarrassingly simple approach to instance segmentation in images.
Ranked #55 on Instance Segmentation on COCO test-dev
In this paper, we first analyse the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground depth and background depth using separate optimization objectives and depth decoders.
Different functional areas of the human brain play different roles in brain activity, which has not been paid sufficient research attention in the brain-computer interface (BCI) field.
In FoveaBox, an instance is assigned to adjacent feature levels to make the model more accurate. We demonstrate its effectiveness on standard benchmarks and report extensive experimental analysis.
Ranked #93 on Object Detection on COCO test-dev (APM metric)
In this paper, we begin by investigating current feature pyramids solutions, and then reformulate the feature pyramid construction as the feature reconfiguration process.
As a new classification platform, deep learning has recently received increasing attention from researchers and has been successfully applied to many domains.
To address (a), we design the reverse connection, which enables the network to detect objects on multi-levels of CNNs.
Almost all of the current top-performing object detection networks employ region proposals to guide the search for object instances.