To make a deeper understanding of BN, in this work we prove that BN actually introduces a certain level of noise into the sample mean and variance during the training process, while the noise level depends only on the batch size.
To address the scarcity of meteorological prediction data, we constructed the RainBench, a large-scale radar echo dataset specific to the unique precipitation characteristics of inland regions in China for precipitation prediction.
Existing knowledge distillation works for semantic segmentation mainly focus on transferring high-level contextual knowledge from teacher to student.
An intuitive solution is ``coupling'' the CAM with the long-range attention matrix of visual transformers (ViT) We find that the direct ``coupling'', e. g., pixel-wise multiplication of attention and activation, achieves a more global coverage (on the foreground), but unfortunately goes with a great increase of false positives, i. e., background pixels are mistakenly included.
Face clustering is a promising way to scale up face recognition systems using large-scale unlabeled face images.
Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors.
The likelihood maps generated by the SLV module are used to supervise the feature learning of the backbone network, encouraging the network to attend to wider and more diverse areas of the image.
Monocular 3D object detection is an essential task in autonomous driving.
Structural re-parameterization has drawn increasing attention in various computer vision tasks.
The application of cross-dataset training in object detection tasks is complicated because the inconsistency in the category range across datasets transforms fully supervised learning into semi-supervised learning.
To address this problem, we formulate CTR prediction as a continual learning task and propose COLF, a hybrid COntinual Learning Framework for CTR prediction, which has a memory-based modular architecture that is designed to adapt, learn and give predictions continuously when faced with non-stationary drifting click data streams.
Despite the promising progress having been made, the two challenges of multi-view clustering (MVC) are still waiting for better solutions: i) Most existing methods are either not qualified or require additional steps for incomplete multi-view clustering and ii) noise or outliers might significantly degrade the overall clustering performance.
Taking meta features as reference, we propose compositional operations to eliminate irrelevant features of local convolutional features by an addressing process and then to reformulate the convolutional feature maps as a composition of related meta features.
In this paper, we introduce the balanced and hierarchical learning for our detector.
We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck.
Graph Neural Networks (GNNs) have become increasingly popular and achieved impressive results in many graph-based applications.
Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms.
A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics).
Finding a suitable density function is essential for density-based clustering algorithms such as DBSCAN and DPC.
We propose NETS-ImpGAN, a novel deep learning framework that can be trained on incomplete data with missing values in both history and future.
In this paper, we propose an embarrassing simple yet highly effective adversarial domain adaptation (ADA) method for effectively training models for alignment.
Temporal relational data, perhaps the most commonly used data type in industrial machine learning applications, needs labor-intensive feature engineering and data analyzing for giving precise model predictions.
Though 3D object detection from point clouds has achieved rapid progress in recent years, the lack of flexible and high-performance proposal refinement remains a great hurdle for existing state-of-the-art two-stage detectors.
Current geometry-based monocular 3D object detection models can efficiently detect objects by leveraging perspective geometry, but their performance is limited due to the absence of accurate depth information.
The inheritance part is learned with a similarity loss to transfer the existing learned knowledge from the teacher model to the student model, while the exploration part is encouraged to learn representations different from the inherited ones with a dis-similarity loss.
Driven by the success of deep learning, the last decade has seen rapid advances in person re-identification (re-ID).
Click-through rate (CTR) prediction plays an important role in online advertising and recommender systems.
Besides, CHCF integrates criterion learning and user preference learning into a unified framework, which can be trained jointly for the interaction prediction of the target behavior.
In this paper, we propose a DiscRiminative-gEnerative duAl Memory (DREAM) anomaly detection model to take advantage of a few anomalies and solve data imbalance.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
Despite their success for semantic segmentation, convolutional neural networks are ill-equipped for incremental learning, \ie, adapting the original segmentation model as new classes are available but the initial training data is not retained.
Specifically, we introduce Gait recognition as an auxiliary task to drive the Image ReID model to learn cloth-agnostic representations by leveraging personal unique and cloth-independent gait information, we name this framework as GI-ReID.
Ranked #5 on Person Re-Identification on PRCC
The key challenge is how to distinguish the action of interest segments from the background, which is unlabelled even on the video-level.
The CNN encoder is responsible for efficiently extracting discriminative spatial features while the DI decoder is designed to densely model spatial-temporal inherent interaction across frames.
Ranked #1 on Person Re-Identification on DukeMTMC-reID
The past years have witnessed an explosion of deep learning frameworks like PyTorch and TensorFlow since the success of deep neural networks.
In this paper, we propose a novel solution for object-matching based semi-supervised video object segmentation, where the target object masks in the first frame are provided.
Second, different body parts possess different scales, and even the same part in different frames can appear at different locations and scales.
Ranked #2 on Gait Recognition on OUMVLP
These camera-aware proxies enable us to deal with large intra-ID variance and generate more reliable pseudo labels for learning.
To the best of our knowledge, our CADDet is the first work to introduce dynamic routing mechanism in object detection.
The topology of the adjacency graph is a key factor for modeling the correlations of the input skeletons.
Generally, a low-resolution grid is not sufficient to capture the details, while a high-resolution grid dramatically increases the training complexity.
It re-extracts the features of the tracklets in the current frame based on motion predicting, which is the key to solve the problem of features inconsistent.
However, due to the inefficient representation ability of the pre-trained model, many false positives and negatives in local semantic similarity will be introduced and lead to error propagation during the hash code learning.
On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head.
Ranked #34 on Long-tail Learning on CIFAR-10-LT (ρ=10)
Today, scene graph generation(SGG) task is largely limited in realistic scenarios, mainly due to the extremely long-tailed bias of predicate annotation distribution.
Ranked #4 on Unbiased Scene Graph Generation on Visual Genome
Therefore, it is critical to learn an apparel-invariant person representation under cases like cloth changing or several persons wearing similar clothes.
Most deep-learning-based image classification methods assume that all samples are generated under an independent and identically distributed (IID) setting.
The simulation results show that the grand coalition, where all UAVs join a single coalition, is not always stable due to the profit-maximizing behavior of the UAVs.
Networking and Internet Architecture Signal Processing
Coupled with the rise of Deep Learning, the wealth of data and enhanced computation capabilities of Internet of Vehicles (IoV) components enable effective Artificial Intelligence (AI) based models to be built.
Signal Processing Networking and Internet Architecture
It has been shown that using the first and second order statistics (e. g., mean and variance) to perform Z-score standardization on network activations or weight vectors, such as batch normalization (BN) and weight standardization (WS), can improve the training performance.
Nearest neighbor search aims to obtain the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining.
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e. g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach".
Ranked #1 on Scene Graph Generation on Visual Genome
We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA.
Ranked #23 on Image Captioning on COCO Captions
Our approach performs even comparable to state-of-the-art fully supervised methods in two of the datasets.
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial).
The proposed quantization function can be learned in a lossless and end-to-end manner and works for any weights and activations of neural networks in a simple and uniform way.
In the user representation module, we propose an attentive multi-view learning framework to learn unified representations of users from their heterogeneous behaviors such as search queries, clicked news and browsed webpages.
Model fine-tuning is a widely used transfer learning approach in person Re-identification (ReID) applications, which fine-tuning a pre-trained feature extraction model into the target scenario instead of training a model from scratch.
Since different words and different news articles may have different informativeness for representing news and users, we propose to apply both word- and news-level attention mechanism to help our model attend to important words and news articles.
In the user encoder we learn the representations of users based on their browsed news and apply attention mechanism to select informative news for user representation learning.
Ranked #6 on News Recommendation on MIND
With the gained annotations of the actively selected candidates, the tracklets' pesudo labels are updated by label merging and further used to re-train our re-ID model.
To capture the graph dynamics, we use the graph prediction stream to predict the dynamic graph structures, and the predicted structures are fed into the flow prediction stream.
Person re-identification (Person ReID) is a challenging task due to the large variations in camera viewpoint, lighting, resolution, and human pose.