In this paper, we revisit the importance of feature difference for change detection in RSI, and propose a series of operations to fully exploit the difference information: Alignment, Perturbation and Decoupling (APD).
In this paper, we focus on a recently proposed novel task called Audio-Visual Segmentation (AVS), where the fine-grained correspondence between audio stream and image pixels is required to be established.
The alignment model performs temporal and spatial action alignment sequentially at the feature level, leading to more precise measurements of inter-video similarity.
Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms.
Ranked #2 on Image Classification on Clothing1M (using extra training data)
Training deep neural networks(DNN) with noisy labels is challenging since DNN can easily memorize inaccurate labels, leading to poor generalization ability.
Robust autonomous driving requires agents to accurately identify unexpected areas in urban scenes.
In this paper, for the first time, we investigate the contributions of different distance metrics, and propose an adaptive fusion scheme, bringing significant improvements in few-shot classification.
Specifically, for the clean set, we deliberately design a memory-based modulation scheme to dynamically adjust the contribution of each sample in terms of its historical credibility sequence during training, thus alleviating the effect from noisy samples incorrectly grouped into the clean set.
Unsupervised Domain Adaptation (UDA) aims to adapt the model trained on the labeled source domain to an unlabeled target domain.
Ranked #13 on Unsupervised Domain Adaptation on SYNTHIA-to-Cityscapes
Despite plenty of efforts focusing on improving the domain adaptation ability (DA) under unsupervised or few-shot semi-supervised settings, recently the solution of active learning started to attract more attention due to its suitability in transferring model in a more practical way with limited annotation resource on target data.
This article is a gentle discussion about the field of reinforcement learning in practice, about opportunities and challenges, touching a broad range of topics, with perspectives and without technical details.
We conduct comprehensive comparison and detailed analysis on challenging benchmarks of DAVIS16, DAVIS17 and Youtube-VOS, demonstrating that the cyclic mechanism is helpful to enhance segmentation quality, improve the robustness of VOS systems, and further provide qualitative comparison and interpretation on how different VOS algorithms work.
We demonstrate that both temporal grains are beneficial to atomic action recognition.
For the clean set, we deliberately design a memory-based modulation scheme to dynamically adjust the contribution of each sample in terms of its historical credibility sequence during training, thus to alleviate the effect from potential hard noisy samples in clean set.
The crux of self-supervised video representation learning is to build general features from unlabeled videos.
The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e. g. background).
Pedestrian detection in a crowd is a challenging task due to a high number of mutually-occluding human instances, which brings ambiguity and optimization difficulties to the current IoU-based ground truth assignment procedure in classical object detection methods.
In this paper, we address several inadequacies of current video object segmentation pipelines.
The task of spatial-temporal action detection has attracted increasing attention among researchers.
Ranked #3 on Action Detection on UCF Sports (Video-mAP 0.2 metric)
Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization.
To this end, we present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve (Human-centric video analysis in complex Events), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd & complex events.
The TRP trained network inherently has a low-rank structure, and is approximated with negligible performance loss, thus eliminating the fine-tuning process after low rank decomposition.
To accelerate DNNs inference, low-rank approximation has been widely adopted because of its solid theoretical rationale and efficient implementations.
We start with a brief introduction to reinforcement learning (RL), about its successful stories, basics, an example, issues, the ICML 2019 Workshop on RL for Real Life, how to use it, study material and an outlook.
Segmenting coronary arteries is challenging, as classic unsupervised methods fail to produce satisfactory results and modern supervised learning (deep learning) requires manual annotation which is often time-consuming and can some time be infeasible.
The task of re-identifying groups of people underdifferent camera views is an important yet less-studied problem. Group re-identification (Re-ID) is a very challenging task sinceit is not only adversely affected by common issues in traditionalsingle object Re-ID problems such as viewpoint and human posevariations, but it also suffers from changes in group layout andgroup membership.
We propose Trained Rank Pruning (TRP), which iterates low rank approximation and training.
Depthwise separable convolution has shown great efficiency in network design, but requires time-consuming training procedure with full training-set available.
Based on the deeply supervised object detection (DSOD) framework, we propose Tiny-DSOD dedicating to resource-restricted usages.