We introduce new datasets from Twitter related to anti-Asian hate sentiment before and during the pandemic.
We propose a novel Mix Training (MT) strategy that encourages the model to discover low-energy keywords from noisy and mixed speech.
A comprehensive study was conducted to compare CN-Celeb-AV with two popular public AVPR benchmark datasets, and the results demonstrated that CN-Celeb-AV is more in line with real-world scenarios and can be regarded as a new benchmark dataset for AVPR research.
SpeechFlow is a powerful factorization model based on information bottleneck (IB), and its effectiveness has been reported by several studies.
Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, such as the variational auto-encoder with non-negative matrix factorization (VAE-NMF) architecture.
In this paper, we argue that this problem is largely attributed to the maximum-likelihood (ML) training criterion of the DNF model, which aims to maximize the likelihood of the observations but not necessarily improve the Gaussianality of the latent codes.
Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks.
Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems.
Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets.
Audio and Speech Processing
These datasets tend to deliver over optimistic performance and do not meet the request of research on speaker recognition in unconstrained conditions.
Speech signals are complex composites of various information, including phonetic content, speaker traits, channel effect, etc.
By enforcing the neural model to discriminate the speakers in the training set, deep speaker embedding (called `x-vectors`) can be derived from the hidden layers.
This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian.
This score reflects the similarity of the two frames in phonetic content, and is used to weigh the contribution of this frame pair in the utterance-based scoring.
Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors.
Trivial events are ubiquitous in human to human conversations, e. g., cough, laugh and sniff.
In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e. g., 0. 3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model.
The experiments demonstrated that the feature-based system outperformed the i-vector system with a large margin, particularly with language mismatch between enrollment and test.
This principle has recently been applied to several prototype research on speaker verification (SV), where the feature learning and classifier are learned together with an objective function that is consistent with the evaluation metric.
In this paper, we demonstrated that the speaker factor is also a short-time spectral pattern and can be largely identified with just a few frames using a simple deep neural network (DNN).
Recently deep neural networks (DNNs) have been used to learn speaker features.
Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID).
Deep neural models, particularly the LSTM-RNN model, have shown great potential for language identification (LID).
PLDA is a popular normalization approach for the i-vector model, and it has delivered state-of-the-art performance in speaker verification.
In this paper, we propose a decision making approach based on multiple scores derived from a set of cohort GMMs (cohort scores).
We present the AP16-OL7 database which was released as the training and test data for the oriental language recognition (OLR) challenge on APSIPA 2016.
This paper presents a unified model to perform language and speaker recognition simultaneously and altogether.
This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system.
Although highly correlated, speech and speaker recognition have been regarded as two independent tasks and studied by two communities.
The popular i-vector model represents speakers as low-dimensional continuous vectors (i-vectors), and hence it is a way of continuous speaker embedding.
Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-vector model, and has delivered state-of-the-art performance in speaker recognition.
A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN).
Recent research shows that deep neural networks (DNNs) can be used to extract deep speaker vectors (d-vectors) that preserve speaker characteristics and can be used in speaker verification.