From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder.
Instance segmentation is a challenging task aiming at classifying and segmenting all object instances of specific classes.
Therefore, we assume the task-relevant information that is not shared between views can not be ignored and theoretically prove that the minimal sufficient representation in contrastive learning is not sufficient for the downstream tasks, which causes performance degradation.
By inserting the proposed cross-stage mechanism in existing spatial and temporal transformer blocks, we build a separable transformer network for video learning based on ViT structure, in which self-attentions and features are progressively aggregated from one block to the next.
Our method contains two training stages based on model-agnostic meta learning (MAML), each of which consists of a contrastive branch and a meta branch.
In this paper, we propose a Geometry Uncertainty Projection Network (GUP Net) to tackle the error amplification problem at both inference and training stages.
However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning.
Detecting and localizing objects in the real 3D space, which plays a crucial role in scene understanding, is particularly challenging given only a monocular image due to the geometric information loss during imagery projection.
Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner.
This paper proposes MCSSL, a self-supervised learning approach for building custom object detection models in multi-camera networks.
We develop a conceptually simple, flexible, and effective framework (named T-Net) for two-view correspondence learning.
In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net.
A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes.
Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles.
In this paper, we consider the problem of the scattering of in-plane waves at an interface between a homogeneous medium and a metamaterial.
In this paper, we tackle the above limitation by proposing a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics to boost the re-identification performance.
no code implementations • 4 Dec 2019 • Joyce Fang, Martin Ellis, Bin Li, Siyao Liu, Yasaman Hosseinkashi, Michael Revow, Albert Sadovnikov, Ziyuan Liu, Peng Cheng, Sachin Ashok, David Zhao, Ross Cutler, Yan Lu, Johannes Gehrke
Bandwidth estimation and congestion control for real-time communications (i. e., audio and video conferencing) remains a difficult problem, despite many years of research.
In this paper, we study the problem of 3D object detection from stereo images, in which the key challenge is how to effectively utilize stereo information.
Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller.
Specifically, for each training image, we first generate attention maps to represent the object's discriminative parts by weakly supervised learning.
Ranked #7 on Fine-Grained Image Classification on CUB-200-2011
Most existing methods are computation consuming, which cannot satisfy the real-time requirement.
We present an instance segmentation scheme based on pixel affinity information, which is the relationship of two pixels belonging to a same instance.
We propose MonoGRNet for the amodal 3D object detection from a monocular RGB image via geometric reasoning in both the observed 2D projection and the unobserved depth dimension.
Ranked #19 on Monocular 3D Object Detection on KITTI Cars Moderate
In this paper, we address the problem of reconstructing an object's surface from a single image using generative networks.
Besides, we propose attention regularization and attention dropout to weakly supervise the generating process of attention maps.
In this paper, we improve the learning of local feature descriptors by optimizing the performance of descriptor matching, which is a common stage that follows descriptor extraction in local feature based pipelines, and can be formulated as nearest neighbor retrieval.
The RoI-based sub-region attention map and aspect ratio attention map are selectively pooled from the banks, and then used to refine the original RoI features for RoI classification.
This paper proposes an efficient content adaptive screen image scaling scheme for the real-time screen applications like remote desktop and screen sharing.