The IDR module is designed to reconstruct the remaining details from the residual measurement vector, and MRU is employed to update the residual measurement vector and feed it into the next IDR module.
Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields.
Creating a taxonomy of interests is expensive and human-effort intensive: not only do we need to identify nodes and interconnect them, in order to use the taxonomy, we must also connect the nodes to relevant entities such as users, pins, and queries.
During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations.
Specifically, cheap scene graph supervision data can be easily obtained by parsing image language descriptions into semantic graphs.
We designed RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task.
Change detection (CD) is to decouple object changes (i. e., object missing or appearing) from background changes (i. e., environment variations) like light and season variations in two images captured in the same scene over a long time span, presenting critical applications in disaster management, urban development, etc.
However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting.
In the proposed system, RS facilitates the exploitation of the shared interests of the users in VR streaming, and IRS creates additional propagation channels to support the transmission of high-resolution 360-degree videos.
Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers).
Visual relocalization has been a widely discussed problem in 3D vision: given a pre-constructed 3D visual map, the 6 DoF (Degrees-of-Freedom) pose of a query image is estimated.
The ultimate aim of image restoration like denoising is to find an exact correlation between the noisy and clear image domains.
Change detection (CD) aims to detect change regions within an image pair captured at different times, playing a significant role in diverse real-world applications.
Such design decomposes the process of HOI set prediction into two subsequent phases, i. e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer.
Ranked #3 on Human-Object Interaction Detection on V-COCO
Alternatively, it is possible to exploit the information about the presence of heterophilous neighbors for feature learning, so a hybrid message passing approach is devised to aggregate homophilious neighbors and diversify heterophilous neighbors based on edge classification.
Geolocation is a fundamental component of route planning and navigation for unmanned vehicles, but GNSS-based geolocation fails under denial-of-service conditions.
Unsupervised domain adaption (UDA) aims to adapt models learned from a well-annotated source domain to a target domain, where only unlabeled samples are given.
To further improve SBM, an Integration-and-Distribution Module (IDM) is introduced to enhance frame-level representations.
Recent non-local self-attention methods have proven to be effective in capturing long-range dependencies for semantic segmentation.
It is therefore interesting to study how these two tasks can be coupled to benefit each other.
Without any external training data, our proposed Denoised NL can achieve the state-of-the-art performance of 83. 5\% and 46. 69\% mIoU on Cityscapes and ADE20K, respectively.
Unsupervised learning of depth from indoor monocular videos is challenging as the artificial environment contains many textureless regions.
Detecting out-of-distribution (OOD) data has become a critical component in ensuring the safe deployment of machine learning models in the real world.
Ranked #12 on Out-of-Distribution Detection on ImageNet-1k vs SUN
First, we present a domain composition method that represents one certain domain by a linear combination of a set of basis representations (i. e., a representation bank).
We first propose an outlier masking technique that considers the occluded or dynamic pixels as statistical outliers in the photometric error map.
3D semantic scene completion and 2D semantic segmentation are two tightly correlated tasks that are both essential for indoor scene understanding, because they predict the same semantic classes, using positively correlated high-level features.
We integrate this ranking scheme with two frequency models and a GPT-2 styled language model, along with the acceptance model to yield 27. 80% and 37. 64% increase in TOP1 and TOP5 accuracy, respectively.
Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image.
Ranked #29 on Image Classification on CIFAR-100 (using extra training data)
Detecting out-of-distribution (OOD) inputs is a central challenge for safely deploying machine learning models in the real world.
Ranked #3 on Out-of-Distribution Detection on ImageNet-1k vs iNaturalist (using extra training data)
Detail Branch processes frames at original resolution to preserve the detailed visual clues, and Context Branch with a down-sampling strategy is employed to capture long-range contexts.
Radiotherapy is a treatment where radiation is used to eliminate cancer cells.
Deep Neural Network (DNN) based super-resolution algorithms have greatly improved the quality of the generated images.
We present a new method for scene agnostic camera localization using dense scene matching (DSM), where a cost volume is constructed between a query image and a scene.
Secondly, we propose an AR mapping pipeline which takes the input from the scanning device and produces accurate AR Maps.
In this paper, we propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics while keeping the efficiency high.
Besides, we also designed a more effective fusion module for our fusion scheme.
Ranked #1 on Depth Estimation on Matterport3D
Despite the stateof-the-art performance achieved by Convolutional Neural Networks (CNNs) for automatic segmentation of OARs, existing methods do not provide uncertainty estimation of the segmentation results for treatment planning, and their accuracy is still limited by several factors, including the low contrast of soft tissues in CT, highly imbalanced sizes of OARs and large inter-slice spacing.
Grant-free multiple access (GFMA) is a promising paradigm to efficiently support uplink access of Internet of Things (IoT) devices.
Spotting objects that are visually adapted to their surroundings is challenging for both humans and AI.
In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input.
Ranked #3 on 3D Semantic Scene Completion on SemanticKITTI
The accuracy of deep convolutional neural networks (CNNs) generally improves when fueled with high resolution images.
The attention mechanism has been proved to be helpful in solving the occlusion problem by a large number of existing methods.
Also, we propose a scale attention module implicitly emphasizing the most salient feature maps among multiple scales so that the CNN is adaptive to the size of an object.
In most scenarios, one might obtain annotations of a single or a few organs from one training set, and obtain annotations of the the other organs from another set of training images.
We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud.
Perceptual learning approaches like perceptual loss are empirically powerful for such tasks but they usually rely on the pre-trained classification network to provide features, which are not necessarily optimal in terms of visual perception of image transformation.
Unsupervised learning of depth and ego-motion from unlabelled monocular videos has recently drawn great attention, which avoids the use of expensive ground truth in the supervised one.
Ranked #39 on Monocular Depth Estimation on KITTI Eigen split
However, plenty of studies have shown that global information is crucial for image restoration tasks like image demosaicing and enhancing.
Previous work on cross-lingual sequence labeling tasks either requires parallel data or bridges the two languages through word-byword matching.
In this paper, we propose an end-to-end deep neural network for solving the problem of imbalanced large and small organ segmentation in head and neck (HaN) CT images.
Visual tracking is fragile in some difficult scenarios, for instance, appearance ambiguity and variation, occlusion can easily degrade most of visual trackers to some extent.
Modern deep convolutional neural networks (CNNs) for image classification and object detection are often trained offline on large static datasets.
How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background, while maintaining real-time response, is an essential problem of visual object tracking.
Ranked #5 on Visual Object Tracking on OTB-2013
This paper proposes a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.
Pedestrian re-identification is a difficult problem due to the large variations in a person's appearance caused by different poses and viewpoints, illumination changes, and occlusions.
The aim of this study is to provide an automatic computational framework to assist clinicians in diagnosing Focal Liver Lesions (FLLs) in Contrast-Enhancement Ultrasound (CEUS).
This paper aims at one newly raising task in vision and multimedia research: recognizing human actions from still images.
However, the task in tracking is to search for a specific object, rather than an object category as in detection.