Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.
Considering that there is a large amount of ASR training data, a straightforward method is to leverage ASR data to enhance ALT training.
To sufficiently exploit such important information for recommendation, it is essential to disentangle the benign popularity bias caused by item quality from the harmful popularity bias caused by conformity.
In this paper, we identify and solve the trigger curse problem in few-shot event detection (FSED) from a causal view.
For the task of metal artifact reduction (MAR), although deep learning (DL)-based methods have achieved promising performances, most of them suffer from two problems: 1) the CT imaging geometry constraint is not fully embedded into the network during training, leaving room for further performance improvement; 2) the model interpretability is lack of sufficient consideration.
Knowledge graph completion (KGC) has become a focus of attention across deep learning community owing to its excellent contribution to numerous downstream tasks.
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition.
The holistic knowledge is represented as a unified graph-based embedding by aggregating individual knowledge from relational neighborhood samples with graph neural networks, the student network is learned by distilling the holistic knowledge in a contrastive manner.
In this work, we propose a novel Time-aware Path reasoning for Recommendation (TPRec for short) method, which leverages the potential of temporal information to offer better recommendation with plausible explanations.
In this paper, we report a novel technique termed single-shot SIM, to overcome these limitations.
In practical applications, the outdoor weather and illumination are changeable, e. g., cloudy and nighttime, which results in a significant drop of semantic segmentation accuracy of CNN only trained with daytime data.
It is based on the idea that similar users not only have a similar taste on items, but also have similar treatment effect under recommendations.
This provides a valuable opportunity to develop a universal solution for debiasing, e. g., by learning the debiasing parameters from data.
Learning and analyzing rap lyrics is a significant basis for many web applications, such as music recommendation, automatic music categorization, and music information retrieval, due to the abundant source of digital music in the World Wide Web.
To reduce the difficulty in the discovery of causal structure, we relax it to the sparse associative structure and propose a novel sparse associative structure alignment model for domain adaptation.
However, the social network information may not be available in many recommender systems, which hinders application of SamWalker.
To deal with these problems, we propose an efficient and effective collaborative sampling method CoSam, which consists of: (1) a collaborative sampler model that explicitly leverages user-item interaction information in sampling probability and exhibits good properties of normalization, adaption, interaction information awareness, and sampling efficiency; and (2) an integrated sampler-recommender framework, leveraging the sampler model in prediction to offset the bias caused by uneven sampling.
Existing work addresses this issue with Inverse Propensity Weighting (IPW), which decreases the impact of popular items on the training and increases the impact of long-tail items.
The original design of Graph Convolution Network (GCN) couples feature transformation and neighborhood aggregation for node representation learning.
This motivates us to provide a systematic survey of existing work on RS biases.
To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling.
Empirical results confirm the efficiency and effectiveness of residual frames as well as the proposed pseudo-3D convolution module.
Two colonoscopic datasets from different centres, i. e., CVC-Clinic and ETIS-Larib, are adopted to evaluate the performance of domain adaptation of our VideoGAN.
A popular and effective approach for implicit recommendation is to treat unobserved data as negative but downweight their confidence.
Specifically, we propose a cyclically-trained adversarial network to learn a mapping from image space to latent representation space and back such that the latent representation is invariant to a specified factor of variation (e. g., identity).
Reliable facial expression recognition plays a critical role in human-machine interactions.
Deep convolutional neural networks (ConvNets) have been recently shown to attain state-of-the-art performance for action recognition on standard-resolution videos.
In this paper, we propose to build Concept Bank, the largest concept library consisting of 4, 876 concepts specifically designed to cover 631 real-world events.