On the contrary, a discriminative classifier only models the conditional distribution of labels given inputs, but benefits from effective optimization owing to its succinct structure.
This study aims to develop a novel computer-aided diagnosis (CAD) scheme for mammographic breast mass classification using semi-supervised learning.
The goal of this paper is to conduct a comprehensive study on the facial sketch synthesis (FSS) problem.
Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios.
In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed.
Therefore, we propose a transformer-based Pose-guided Feature Disentangling (PFD) method by utilizing pose information to clearly disentangle semantic components (e. g. human body or joint parts) and selectively match non-occluded parts correspondingly.
Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion.
This framework consists of three key components, i. e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module.
Representation modeling based on user behavior sequences is an important direction in user cognition.
Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples.
A critical issue with the frame-based model is that it pursues the best frame-level prediction rather than the best event-level prediction.
Universal user representation has received many interests recently, with which we can be free from the cumbersome work of training a specific model for each downstream application.
In this paper, we propose Variational Latent-State GPT model (VLS-GPT), which is the first to combine the strengths of the two approaches.
But with the heavy computational cost and high GPU memory occupation of the vision Transformer, the network can not be designed too deep.
Robustness against word substitutions has a well-defined and widely acceptable form, i. e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing.
Deep learning has received extensive research interest in developing new medical image processing algorithms, and deep learning based models have been remarkably successful in a variety of medical imaging tasks to support disease detection and diagnosis.
Instead, exploiting multi-view information is a practical way to achieve absolute 3D human pose estimation.
We argue this is due to the lack of rich information in the probability prediction and the overfitting caused by hard labels.
The performance of existing underwater object detection methods degrades seriously when facing domain shift problem caused by complicated underwater environments.
The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE.
Ranked #1 on 3D Human Pose Estimation on HumanEva-I
We present a sufficient condition for the stability property of extremal graph problems that can be solved via Zykov's symmetrisation.
RGB-Infrared person re-identification (RGB-IR Re-ID) aims to match persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in the surveillance system under poor light conditions.
We prove an asymptotically tight bound on the extremal density guaranteeing subdivisions of bounded-degree bipartite graphs with a mild separability condition.
Combinatorics 05C83, 05C35
(2) Since the target data arrive online, the agent should also maintain competence on previous target domains, i. e. to adapt without forgetting.
Recent works found that fine-tuning and joint training---two popular approaches for transfer learning---do not always improve accuracy on downstream tasks.
Then, support vector machine (SVM) models embedded with several feature dimensionality reduction methods are built to predict likelihood of lesions being malignant.
Deep convolutional neural networks (DCNNs) have dominated as the best performers in machine learning, but can be challenged by adversarial attacks.
Based on the objective coordinate system in frame of oblique shock structure, it is found that the nature of three-dimensional lift-off structure of a shockinduced streamwise vortex is inherently and precisely controlled by a two-stage growth mode of structure kinetics of a shock bubble interaction (SBI for short).
As an angularly discriminative feature space is important for classifying the human images based on their embedding vectors, in this paper, we propose a novel ranking loss function, named Bi-directional Exponential Angular Triplet Loss, to help learn an angularly separable common feature space by explicitly constraining the included angles between embedding vectors.
Then the activated dictionary atoms are assembled and passed to the compound dictionary learning and coding layers.
For reducing the solution space, we first model the adversarial perturbation optimization problem as a process of recovering frequency-sparse perturbations with compressed sensing, under the setting that random noise in the low-frequency space is more likely to be adversarial.
This paper aims to build a GUOD with small underwater dataset with limited types of water quality.
To compensate for the impact of time offset, our method includes two short-term motion interpolation algorithms for the camera and IMU pose estimation.
The convolution operation suffers from a limited receptive filed, while global modeling is fundamental to dense prediction tasks, such as semantic segmentation.
3D skeleton-based action recognition, owing to the latent advantages of skeleton, has been an active topic in computer vision.
In this paper, we analyze the limitation of the existing symmetric GAN models in asymmetric translation tasks, and propose an AsymmetricGAN model with both translation and reconstruction generators of unequal sizes and different parameter-sharing strategy to adapt to the asymmetric need in both unsupervised and supervised image-to-image translation tasks.
The proposed model consists of a single generator and a discriminator taking a conditional image and the target controllable structure as input.
Ranked #1 on Cross-View Image-to-Image Translation on Dayton (64x64) - ground-to-aerial (LPIPS metric)
State-of-the-art methods in image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data.
Ranked #1 on Facial Expression Translation on AR Face
For this, we train transformer model using feature sequence of audio and their phoneme sequence with lexical stress marks.
Deep learning models have shown their vulnerabilities to universal adversarial perturbations (UAP), which are quasi-imperceptible.
In this paper, we describe in detail the system we submitted to DCASE2019 task 4: sound event detection (SED) in domestic environments.
It is designed to compute the representation of each position by a weighted sum of the features at all positions.
Ranked #5 on Semantic Segmentation on PASCAL VOC 2012 test
In the context of e-payment transaction graphs, the resultant node and edge embeddings can effectively characterize the user-background as well as the financial transaction patterns of individual account holders.
Instead of designing a single model by considering a trade-off between the two sub-targets, we design a teacher model aiming at audio tagging to guide a student model aiming at boundary detection to learn using the unlabeled data.
While several methods have been proposed to address OSDA, none of them takes into account the openness of the target domain, which is measured by the proportion of unknown classes in all target classes.
In this paper, a special decision surface for the weakly-supervised sound event detection (SED) and a disentangled feature (DF) for the multi-label problem in polyphonic SED are proposed.
Experimental results show that the DB-ResNet achieves superior segmentation performance with an average dice score of 82. 74% on the dataset.
The CNN architecture in the first stage is based on the improved UNet segmentation network to establish an initial detection of lung nodules.
Notably, the proposed HCOH can be embedded with supervised labels and it not limited to a predefined category number.
In computer vision, image datasets used for classification are naturally associated with multiple labels and comprised of multiple views, because each image may contain several objects (e. g. pedestrian, bicycle and tree) and is properly characterized by multiple visual features (e. g. color, texture and shape).
In this paper, we propose a novel supervised online hashing method, termed Balanced Similarity for Online Discrete Hashing (BSODH), to solve the above problems in a unified framework.
Gesture recognition is a hot topic in computer vision and pattern recognition, which plays a vitally important role in natural human-computer interface.
Ranked #1 on Hand Gesture Recognition on Cambridge
In this paper, we make the first attempt towards visual feature translation to break through the barrier of using features across different visual search systems.
High-accuracy and high-efficiency finite-time Lyapunov exponent (FTLE) calculation method has long been a research hot point, and adaptive refinement method is a kind of method in this field.
Estimation of the frequency and duration of logos in videos is important and challenging in the advertisement industry as a way of estimating the impact of ad purchases.
In heavy rain, rain streaks have various directions and shapes, which can be regarded as the accumulation of multiple rain streak layers.
Ranked #6 on Single Image Deraining on Test100
In principle, CerfGAN contains a novel component, i. e., a multi-class discriminator (MCD), which gives the model an extremely powerful ability to match multiple translation mappings.
The localization of BMC is achieved from a color transformation enhanced BMC sample image and stepwise averaging method (SAM).
Recent works have shown the benefit of integrating Conditional Random Fields (CRFs) models into deep architectures for improving pixel-level prediction tasks.
Then, motion and shape cues are jointly used to generate robust and distinctive spatial-temporal interest points (STIPs): motion-based STIPs and shape-based STIPs.
First, a sequence-based view invariant transform is developed to eliminate the effect of view variations on spatio-temporal locations of skeleton joints.
Ranked #2 on Skeleton Based Action Recognition on UWA3D
In this paper, we propose a hashing scheme, termed Fusion Similarity Hashing (FSH), which explicitly embeds the graph-based fusion similarity across modalities into a common Hamming space.
Extensive experiments on the SmartHome dataset and the large-scale NTU RGB-D dataset demonstrate that our method outperforms most of RNN-based methods, which verify the complementary property between spatial and temporal information and the robustness to noise.
This paper proposes a novel human action recognition using the decision-level fusion of both skeleton and depth sequence.
By given a large-scale training data set, it is very expensive to embed such ranking tuples in binary code learning.
Experimental results on three public datasets and two proposed datasets demonstrate the superiority of the proposed approach, indicating the effectiveness of body structure and orientation information for improving re-identification performance.