Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space.
To address those challenges, we adopt a primitive-based representation for 3D object, and propose a two-stage graph network for primitive-based 3D object estimation, which consists of a sequential proposal module and a graph reasoning module.
The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion.
Incremental learning of semantic segmentation has emerged as a promising strategy for visual scene interpretation in the open- world setting.
In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario.
Learning segmentation from noisy labels is an important task for medical image analysis due to the difficulty in acquiring highquality annotations.
Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications.
Moreover, we introduce a weak annotation scheme with a hybrid label design for volumetric images, which improves model learning without increasing the overall annotation cost.
Scene graph generation is an important visual understanding task with a broad range of vision applications.
We address the problem of class incremental learning, which is a core step towards achieving adaptive vision intelligence.
Motivated by our discovery, we propose a unified distribution alignment strategy for long-tail visual recognition.
Ranked #4 on Long-tail Learning on Places-LT
We introduce GNeRF, a framework to marry Generative Adversarial Networks (GAN) with Neural Radiance Field (NeRF) reconstruction for the complex scenarios with unknown and even randomly initialized camera poses.
Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding.
Our numerical studies confirm the conquer estimator as a practical and reliable approach to large-scale inference for quantile regression.
Statistics Theory Methodology Statistics Theory
We are the first that exploit confidence during refinement to improve semantic matching accuracy and develop an end-to-end self-supervised adversarial learning procedure for the entire matching network.
To be clear, in this paper, we refer unsupervised learning as learning without task-specific human annotations, pairs or any form of weak supervision.)
Semi-supervised learning has attracted much attention in medical image segmentation due to challenges in acquiring pixel-wise image annotations, which is a crucial step for building high-performance deep learning methods.
In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation.
Ranked #3 on Few-Shot Semantic Segmentation on Pascal5i
Despite recent success of deep network-based Reinforcement Learning (RL), it remains elusive to achieve human-level efficiency in learning novel tasks.
We present a context aware object detection method based on a retrieve-and-transform scene layout model.
To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.
Reasoning human object interactions is a core problem in human-centric scene understanding and detecting such relations poses a unique challenge to vision systems due to large variations in human-object configurations, multiple co-occurring relation instances and subtle visual difference between relation categories.
We instantiate our strategy by designing an end-to-end learnable deep network, named as Dynamic Context Correspondence Network (DCCNet).
A promising strategy is to model the feature context by a fully-connected graph neural network (GNN), which augments traditional convolutional features with an estimated non-local context representation.
We consider a fixed-price mechanism design setting where a seller sells one item via a social network, but the seller can only directly communicate with her neighbours initially.
We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images.
In this work, we take a transformation based approach that predicts a 2D non-rigid spatial transform and warps the shape mask onto the target object.
In particular, while some of them aim at segmenting the image into regions, such as object or surface instances, others aim at inferring the semantic labels of given regions, or their support relationships.
On the other hand, we find that the attention of different subjects consistently focuses on a single face in each frame of videos involving multiple faces.
In this context, existing methods typically propose candidate objects, usually as bounding boxes, and directly predict a binary mask within each such proposal.
With increasing demand for efficient image and video analysis, test-time cost of scene parsing becomes critical for many large-scale or time-sensitive vision applications.
We apply a constrained mean-field algorithm to estimate the pixel-level labels, and use the estimated labels to update the parameters of the CNN in an iterative EM framework.
In particular, we introduce a deep structured network that jointly predicts the objectness scores and the bounding box locations of multiple object candidates.
Despite much progress, state-of-the-art techniques suffer from two drawbacks: (i) they rely on the assumption that intensity edges coincide with depth discontinuities, which, unfortunately, is only true in controlled environments; and (ii) they typically exploit the availability of high-resolution training depth maps, which can often not be acquired in practice due to the sensors' limitations.
To exploit the correlations between objects, we build a fully-connected CRF on the candidates, which explicitly incorporates both geometric layout relations across object classes and similarity relations across multiple images.
We design a system to describe an image with emotions, and present a model that automatically generates captions with positive or negative sentiments.
We tackle the problem of single image depth estimation, which, without additional knowledge, suffers from many ambiguities.
To scale up our method, we adopt an active inference strategy to improve the efficiency, which adaptively selects object subgraphs in the object-augmented dense CRF.
We address the problem of joint detection and segmentation of multiple object instances in an image, a key step towards scene understanding.