Speaker verification is the verifying the identity of a person from characteristics of the voice.
Learning a good speaker embedding is important for many automatic speaker recognition tasks, including verification, identification and diarization.
The improvements are both based on triplet cause the training stage and the evaluation stage of the baseline x-vector system focus on different aims.
In training our speaker verification framework, we consider both the triplet loss minimization and adversarial gradient of the ASR network to obtain more discriminative and text-independent speaker embedding vectors.
The experimental evaluation compares converted voices between the proposed method that does not use the targeted speaker's voice data and the standard VC that uses the data.
In particular, we achieve an acceleration of 3000 times in frame posterior computation compared to real time and 25 times in training the i-vector extractor compared to the CPU baseline from Kaldi toolkit.
Furthermore, we apply deep length normalization by augmenting the loss function with ring loss.
In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity.