Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
In this work, we focus on capturing the data-inherent uncertainty (aka aleatoric uncertainty) in segmentation, typically when ambiguities exist in input images.
In this work, we focus on learning a VLP model with sequential chunks of image-text pair data.
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with great transferability, which achieves promising accuracy for zero-shot classification.
In this paper, we study the problem of one-shot skeleton-based action recognition, which poses unique challenges in learning transferable representation from base classes to novel classes, particularly for fine-grained actions.
In this paper, we aim to address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations.
Multi-modal medical image completion has been extensively applied to alleviate the missing modality issue in a wealth of multi-modal diagnostic tasks.
Our insight is to utilize mutual information to measure the relation between seen classes and unseen classes in a restricted label space and maximizing mutual information promotes transferring semantic knowledge.
Real-Time Bidding (RTB) is an important mechanism in modern online advertising systems.
The framework consisted of two closely linked modules: 1) the lamina detector for identifying and locating each lamina pairs on ultrasound coronal images, and 2) the spinal curvature estimator for calculating the scoliotic angles based on the chain of detected lamina.
Continual learning is an important problem for achieving human-level intelligence in real-world applications as an agent must continuously accumulate knowledge in response to streaming data/tasks.
We aim to tackle the problem of point-based interactive segmentation, in which two key challenges are to infer user's intention correctly and to propagate the user-provided annotations to unlabeled regions efficiently.
Ranked #3 on Interactive Segmentation on SBD
Weakly supervised nuclei segmentation is a critical problem for pathological image analysis and greatly benefits the community due to the significant reduction of labeling cost.
In this work, we introduce a new budget-aware few-shot learning problem that not only aims to learn novel object categories, but also needs to select informative examples to annotate in order to achieve data efficiency.
We develop a decoding-and-assembling paradigm for the end-to-end scene graph generation.
Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.
To address those challenges, we adopt a primitive-based representation for 3D object, and propose a two-stage graph network for primitive-based 3D object estimation, which consists of a sequential proposal module and a graph reasoning module.
The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion.
Incremental learning of semantic segmentation has emerged as a promising strategy for visual scene interpretation in the open- world setting.
In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario.
Learning segmentation from noisy labels is an important task for medical image analysis due to the difficulty in acquiring highquality annotations.
Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications.
Moreover, we introduce a weak annotation scheme with a hybrid label design for volumetric images, which improves model learning without increasing the overall annotation cost.
Scene graph generation is an important visual understanding task with a broad range of vision applications.
We address the problem of class incremental learning, which is a core step towards achieving adaptive vision intelligence.
Motivated by our discovery, we propose a unified distribution alignment strategy for long-tail visual recognition.
Ranked #15 on Long-tail Learning on Places-LT
We introduce GNeRF, a framework to marry Generative Adversarial Networks (GAN) with Neural Radiance Field (NeRF) reconstruction for the complex scenarios with unknown and even randomly initialized camera poses.
Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding.
Our numerical studies confirm the conquer estimator as a practical and reliable approach to large-scale inference for quantile regression.
Statistics Theory Methodology Statistics Theory
We are the first that exploit confidence during refinement to improve semantic matching accuracy and develop an end-to-end self-supervised adversarial learning procedure for the entire matching network.
To be clear, in this paper, we refer unsupervised learning as learning without task-specific human annotations, pairs or any form of weak supervision.)
Semi-supervised learning has attracted much attention in medical image segmentation due to challenges in acquiring pixel-wise image annotations, which is a crucial step for building high-performance deep learning methods.
In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation.
Ranked #3 on Few-Shot Semantic Segmentation on Pascal5i
Despite recent success of deep network-based Reinforcement Learning (RL), it remains elusive to achieve human-level efficiency in learning novel tasks.
We present a context aware object detection method based on a retrieve-and-transform scene layout model.
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Reasoning human object interactions is a core problem in human-centric scene understanding and detecting such relations poses a unique challenge to vision systems due to large variations in human-object configurations, multiple co-occurring relation instances and subtle visual difference between relation categories.
We instantiate our strategy by designing an end-to-end learnable deep network, named as Dynamic Context Correspondence Network (DCCNet).
A promising strategy is to model the feature context by a fully-connected graph neural network (GNN), which augments traditional convolutional features with an estimated non-local context representation.
We consider a fixed-price mechanism design setting where a seller sells one item via a social network, but the seller can only directly communicate with her neighbours initially.
We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images.
In this work, we take a transformation based approach that predicts a 2D non-rigid spatial transform and warps the shape mask onto the target object.
In particular, while some of them aim at segmenting the image into regions, such as object or surface instances, others aim at inferring the semantic labels of given regions, or their support relationships.
On the other hand, we find that the attention of different subjects consistently focuses on a single face in each frame of videos involving multiple faces.
In this context, existing methods typically propose candidate objects, usually as bounding boxes, and directly predict a binary mask within each such proposal.
With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale or time-sensitive vision applications.
We apply a constrained mean-field algorithm to estimate the pixel-level labels, and use the estimated labels to update the parameters of the CNN in an iterative EM framework.
In particular, we introduce a deep structured network that jointly predicts the objectness scores and the bounding box locations of multiple object candidates.
Despite much progress, state-of-the-art techniques suffer from two drawbacks: (i) they rely on the assumption that intensity edges coincide with depth discontinuities, which, unfortunately, is only true in controlled environments; and (ii) they typically exploit the availability of high-resolution training depth maps, which can often not be acquired in practice due to the sensors' limitations.
To exploit the correlations between objects, we build a fully-connected CRF on the candidates, which explicitly incorporates both geometric layout relations across object classes and similarity relations across multiple images.
We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments.
We tackle the problem of single image depth estimation, which, without additional knowledge, suffers from many ambiguities.
To scale up our method, we adopt an active inference strategy to improve the efficiency, which adaptively selects object subgraphs in the object-augmented dense CRF.
We address the problem of joint detection and segmentation of multiple object instances in an image, a key step towards scene understanding.
We propose a structured Hough voting method for detecting objects with heavy occlusion in indoor environments.