This second paper presents a literature review of key enabling technologies of digital twins, with an emphasis on uncertainty quantification, optimization methods, open source datasets and tools, major findings, challenges, and future directions.
In part two of this review, the role of uncertainty quantification and optimization are discussed, a battery digital twin is demonstrated, and more perspectives on the future of digital twin are shared.
We present a novel paradigm of building an animatable 3D human representation from a monocular video input, such that it can be rendered in any unseen poses and views.
But we find existing graph-based methods in the visible-infrared person re-identification task (VI-ReID) suffer from bad generalization because of two issues: 1) train-test modality balance gap, which is a property of VI-ReID task.
Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel.
Ranked #1 on Video Frame Interpolation on X4K1000FPS
Besides estimating the probability distribution, our entropy model also generates the quantization step at spatial-channel-wise.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement.
In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We leverage the cycle consistency to discriminate the semantic consensus, thus advancing the primary task.
We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively.
Deep neural networks often suffer the data distribution shift between training and testing, and the batch statistics are observed to reflect the shift.
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency.
Improving the generalization capability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge.
The temporal features usually contain various noisy and uncorrelated information, and they may interfere with the restoration of the current frame.
It significantly improves the performance of several classic contrastive learning models in downstream tasks.
This paper presents ActiveMLP, a general MLP-like backbone for computer vision.
Ranked #36 on Object Detection on COCO minival
Distribution forecast can quantify forecast uncertainty and provide various forecast scenarios with their corresponding estimated probabilities.
For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance.
Deep-learning based methods have shown their advantages in audio coding over traditional ones but limited attention has been paid on real-time communications (RTC).
In this paper, we propose a simple yet effective recursive least-squares estimator-aided online learning approach for few-shot online adaptation without requiring offline training.
Human action detection is a hot topic, which is widely used in video surveillance, human machine interface, healthcare monitoring, gaming, dancing training and musical instrument teaching.
We introduce two modulators, propagation and correction modulators, to separately perform channel-wise re-calibration on the target frame embeddings according to local temporal correlations and reliable references respectively.
Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames.
From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder.
Instance segmentation is a challenging task aiming at classifying and segmenting all object instances of specific classes.
By inserting the proposed cross-stage mechanism in existing spatial and temporal transformer blocks, we build a separable transformer network for video learning based on ViT structure, in which self-attentions and features are progressively aggregated from one block to the next.
Deep learning-based video compression is a challenging task and many previous state-of-the-art learning-based video codecs use optical flows to exploit the temporal correlation between successive frames and then compress the residual error.
Therefore, we assume the task-relevant information that is not shared between views can not be ignored and theoretically prove that the minimal sufficient representation in contrastive learning is not sufficient for the downstream tasks, which causes performance degradation.
Our method contains two training stages based on model-agnostic meta learning (MAML), each of which consists of a contrastive branch and a meta branch.
Ranked #20 on Self-Supervised Action Recognition on UCF101
In this paper, we propose a Geometry Uncertainty Projection Network (GUP Net) to tackle the error amplification problem at both inference and training stages.
However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning.
Detecting and localizing objects in the real 3D space, which plays a crucial role in scene understanding, is particularly challenging given only a monocular image due to the geometric information loss during imagery projection.
Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner.
This paper proposes MCSSL, a self-supervised learning approach for building custom object detection models in multi-camera networks.
We develop a conceptually simple, flexible, and effective framework (named T-Net) for two-view correspondence learning.
In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net.
Ranked #1 on Speech Enhancement on Deep Noise Suppression (DNS) Challenge (PESQ-NB metric)
A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes.
Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles.
In this paper, we consider the problem of the scattering of in-plane waves at an interface between a homogeneous medium and a metamaterial.
In this paper, we tackle the above limitation by proposing a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics to boost the re-identification performance.
no code implementations • 4 Dec 2019 • Joyce Fang, Martin Ellis, Bin Li, Siyao Liu, Yasaman Hosseinkashi, Michael Revow, Albert Sadovnikov, Ziyuan Liu, Peng Cheng, Sachin Ashok, David Zhao, Ross Cutler, Yan Lu, Johannes Gehrke
Bandwidth estimation and congestion control for real-time communications (i. e., audio and video conferencing) remains a difficult problem, despite many years of research.
In this paper, we study the problem of 3D object detection from stereo images, in which the key challenge is how to effectively utilize stereo information.
Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller.
Specifically, for each training image, we first generate attention maps to represent the object's discriminative parts by weakly supervised learning.
Ranked #10 on Fine-Grained Image Classification on CUB-200-2011
Most existing methods are computation consuming, which cannot satisfy the real-time requirement.
We present an instance segmentation scheme based on pixel affinity information, which is the relationship of two pixels belonging to a same instance.
We propose MonoGRNet for the amodal 3D object detection from a monocular RGB image via geometric reasoning in both the observed 2D projection and the unobserved depth dimension.
Ranked #21 on Monocular 3D Object Detection on KITTI Cars Moderate
In this paper, we address the problem of reconstructing an object's surface from a single image using generative networks.
Besides, we propose attention regularization and attention dropout to weakly supervise the generating process of attention maps.
In this paper, we improve the learning of local feature descriptors by optimizing the performance of descriptor matching, which is a common stage that follows descriptor extraction in local feature based pipelines, and can be formulated as nearest neighbor retrieval.
The RoI-based sub-region attention map and aspect ratio attention map are selectively pooled from the banks, and then used to refine the original RoI features for RoI classification.
This paper proposes an efficient content adaptive screen image scaling scheme for the real-time screen applications like remote desktop and screen sharing.