Speaker verification is the verifying the identity of a person from characteristics of the voice.
These systems are explored for non-native spoken English data in this paper.
The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor.
Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks.
By enforcing the neural model to discriminate the speakers in the training set, deep speaker embedding (called `x-vectors`) can be derived from the hidden layers.
Learning a good speaker embedding is important for many automatic speaker recognition tasks, including verification, identification and diarization.
The improvements are both based on triplet cause the training stage and the evaluation stage of the baseline x-vector system focus on different aims.
In training our speaker verification framework, we consider both the triplet loss minimization and adversarial gradient of the ASR network to obtain more discriminative and text-independent speaker embedding vectors.
The experimental evaluation compares converted voices between the proposed method that does not use the targeted speaker's voice data and the standard VC that uses the data.