The IDR module is designed to reconstruct the remaining details from the residual measurement vector, and MRU is employed to update the residual measurement vector and feed it into the next IDR module.
Cross-view geo-localization (CVGL), which aims to estimate the geographical location of the ground-level camera by matching against enormous geo-tagged aerial (e. g., satellite) images, remains extremely challenging due to the drastic appearance differences across views.
Unsupervised domain adaption (UDA) aims to adapt models learned from a well-annotated source domain to a target domain, where only unlabeled samples are given.
Recent non-local self-attention methods have proven to be effective in capturing long-range dependencies for semantic segmentation.
It is therefore interesting to study how these two tasks can be coupled to benefit each other.
Without any external training data, our proposed Denoised NL can achieve the state-of-the-art performance of 83. 5\% and 46. 69\% mIoU on Cityscapes and ADE20K, respectively.
Unsupervised learning of depth from indoor monocular videos is challenging as the artificial environment contains many textureless regions.
Detecting out-of-distribution (OOD) data has become a critical component in ensuring the safe deployment of machine learning models in the real world.
First, we present a domain composition method that represents one certain domain by a linear combination of a set of basis representations (i. e., a representation bank).
We first propose an outlier masking technique that considers the occluded or dynamic pixels as statistical outliers in the photometric error map.
3D semantic scene completion and 2D semantic segmentation are two tightly correlated tasks that are both essential for indoor scene understanding, because they predict the same semantic classes, using positively correlated high-level features.
We integrate this ranking scheme with two frequency models and a GPT-2 styled language model, along with the acceptance model to yield 27. 80% and 37. 64% increase in TOP1 and TOP5 accuracy, respectively.
Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
Ranked #26 on Image Classification on CIFAR-100 (using extra training data)
Detecting out-of-distribution (OOD) inputs is a central challenge for safely deploying machine learning models in the real world.
Detail Branch processes frames at original resolution to preserve the detailed visual clues, and Context Branch with a down-sampling strategy is employed to capture long-range contexts.
Radiotherapy is a treatment where radiation is used to eliminate cancer cells.
Deep Neural Network (DNN) based super-resolution algorithms have greatly improved the quality of the generated images.
We present a new method for scene agnostic camera localization using dense scene matching (DSM), where a cost volume is constructed between a query image and a scene.
Secondly, we propose an AR mapping pipeline which takes the input from the scanning device and produces accurate AR Maps.
In this paper, we propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics while keeping the efficiency high.
Besides, we also designed a more effective fusion module for our fusion scheme.
Ranked #1 on Depth Estimation on Matterport3D
Despite the stateof-the-art performance achieved by Convolutional Neural Networks (CNNs) for automatic segmentation of OARs, existing methods do not provide uncertainty estimation of the segmentation results for treatment planning, and their accuracy is still limited by several factors, including the low contrast of soft tissues in CT, highly imbalanced sizes of OARs and large inter-slice spacing.
Spotting objects that are visually adapted to their surroundings is challenging for both humans and AI.
In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input.
Ranked #16 on LIDAR Semantic Segmentation on nuScenes
The accuracy of deep convolutional neural networks (CNNs) generally improves when fueled with high resolution images.
The attention mechanism has been proved to be helpful in solving the occlusion problem by a large number of existing methods.
Also, we propose a scale attention module implicitly emphasizing the most salient feature maps among multiple scales so that the CNN is adaptive to the size of an object.
In most scenarios, one might obtain annotations of a single or a few organs from one training set, and obtain annotations of the the other organs from another set of training images.
We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud.
Perceptual learning approaches like perceptual loss are empirically powerful for such tasks but they usually rely on the pre-trained classification network to provide features, which are not necessarily optimal in terms of visual perception of image transformation.
Unsupervised learning of depth and ego-motion from unlabelled monocular videos has recently drawn great attention, which avoids the use of expensive ground truth in the supervised one.
Ranked #25 on Monocular Depth Estimation on KITTI Eigen split
However, plenty of studies have shown that global information is crucial for image restoration tasks like image demosaicing and enhancing.
Previous work on cross-lingual sequence labeling tasks either requires parallel data or bridges the two languages through word-byword matching.
In this paper, we propose an end-to-end deep neural network for solving the problem of imbalanced large and small organ segmentation in head and neck (HaN) CT images.
Visual tracking is fragile in some difficult scenarios, for instance, appearance ambiguity and variation, occlusion can easily degrade most of visual trackers to some extent.
Modern deep convolutional neural networks (CNNs) for image classification and object detection are often trained offline on large static datasets.
How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background, while maintaining real-time response, is an essential problem of visual object tracking.
Ranked #5 on Visual Object Tracking on OTB-2013
This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.
Pedestrian re-identification is a difficult problem due to the large variations in a person's appearance caused by different poses and viewpoints, illumination changes, and occlusions.
The aim of this study is to provide an automatic computational framework to assist clinicians in diagnosing Focal Liver Lesions (FLLs) in Contrast-Enhancement Ultrasound (CEUS).
This paper aims at one newly raising task in vision and multimedia research: recognizing human actions from still images.
However, the task in tracking is to search for a specific object, rather than an object category as in detection.