no code implementations • 16 Sep 2024 • Nikhil Raghav, Avisek Gupta, Md Sahidullah, Swagatam Das
Spectral clustering has proven effective in grouping speech representations for speaker diarization tasks, although post-processing the affinity matrix remains difficult due to the need for careful tuning before constructing the Laplacian.
no code implementations • 16 Sep 2024 • Nikhil Raghav, Subhajit Saha, Md Sahidullah, Swagatam Das
In this report, we describe the speaker diarization (SD) and language diarization (LD) systems developed by our team for the Second DISPLACE Challenge, 2024.
no code implementations • 12 Sep 2024 • Shakeel A. Sheikh, Yacouba Kaloga, Md Sahidullah, Ina Kodrasi
Additionally, not all speech segments from PD patients exhibit clear dysarthric symptoms, introducing label noise that can negatively affect the performance and generalizability of current approaches.
no code implementations • 16 Aug 2024 • Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi
ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions.
no code implementations • 25 Jun 2024 • Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen
Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class.
no code implementations • 21 Mar 2024 • Nikhil Raghav, Md Sahidullah
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components.
1 code implementation • 21 Mar 2024 • Subhajit Saha, Md Sahidullah, Swagatam Das
In contrast to existing methods that fine-tune SSL models and employ additional deep neural networks for downstream tasks, we exploit classical machine learning algorithms such as logistic regression and shallow neural networks using the SSL embeddings extracted using the pre-trained model.
Ranked #2 on Voice Anti-spoofing on ASVspoof 2019 - LA
1 code implementation • 23 Feb 2024 • Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen
One promising approach is to align vocal-tract parameters between adults and children through children-specific data augmentation, referred here to as ChildAugment.
no code implementations • 20 Jan 2024 • Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen
To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase.
no code implementations • 13 Jun 2023 • Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen
The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb.
no code implementations • 1 Jun 2023 • Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets.
no code implementations • 31 May 2023 • Hye-jin Shim, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen
Shortcut learning, or `Clever Hans effect` refers to situations where a learning agent (e. g., deep neural networks) learns spurious correlations present in data, resulting in biased models.
1 code implementation • 30 May 2023 • Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung
Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge.
no code implementations • 2 Mar 2023 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
Even though deep speaker models have demonstrated impressive accuracy in speaker verification tasks, this often comes at the expense of increased model size and computation time, presenting challenges for deployment in resource-constrained environments.
no code implementations • 21 Feb 2023 • Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech, resulting in an overall improvement of 4. 48% in F 1 over the single context based MB StutterNet.
no code implementations • 10 Feb 2023 • Spandan Dey, Md Sahidullah, Goutam Saha
Our experiments demonstrate that the proposed domain diversification is more promising over commonly used simple augmentation methods.
no code implementations • 14 Jan 2023 • Premjeet Singh, Md Sahidullah, Goutam Saha
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER).
no code implementations • 30 Nov 2022 • Spandan Dey, Md Sahidullah, Goutam Saha
In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field.
no code implementations • 29 Nov 2022 • Premjeet Singh, Shefali Waldekar, Md Sahidullah, Goutam Saha
This work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER).
no code implementations • 2 Nov 2022 • Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang, Luis Buera
This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge.
no code implementations • 20 Jul 2022 • Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
In this paper, we present end-to-end and speech embedding based systems trained in a self-supervised fashion to participate in the ACM Multimedia 2022 ComParE Challenge, specifically the stuttering sub-challenge.
1 code implementation • 30 Apr 2022 • Alexey Sholokhov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen
Speaker recognition on household devices, such as smart speakers, features several challenges: (i) robustness across a vast number of heterogeneous domains (households), (ii) short utterances, (iii) possibly absent speaker labels of the enrollment data (passive enrollment), and (iv) presence of unknown persons (guests).
no code implementations • 4 Apr 2022 • Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
By automatic detection and identification of stuttering, speech pathologists can track the progression of disfluencies of persons who stutter (PWS).
no code implementations • 4 Apr 2022 • Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
The adoption of advanced deep learning (DL) architecture in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets.
no code implementations • 21 Mar 2022 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module.
no code implementations • 10 Feb 2022 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
We consider different kinds of channel-dependent (CD) nonlinear compression methods optimized in a data-driven manner.
no code implementations • 21 Oct 2021 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs).
no code implementations • 24 Sep 2021 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
We address far-field speaker verification with deep neural network (DNN) based speaker embedding extractor, where mismatch between enrollment and test data often comes from convolutive effects (e. g. room reverberation) and noise.
no code implementations • 24 Sep 2021 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification.
no code implementations • 1 Sep 2021 • Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado
In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection.
1 code implementation • 1 Sep 2021 • Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi
The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures.
no code implementations • 8 Jul 2021 • Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds.
1 code implementation • 11 Jun 2021 • Tomi Kinnunen, Andreas Nautsch, Md Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee
Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity.
no code implementations • 25 May 2021 • Nirmalya Sen, Md Sahidullah, Hemant Patil, Shyamal Kumar Das Mandal, Sreenivasa Krothapalli Rao, Tapan Kumar Basu
This work presents a detailed experimental review and analysis of the GMM-SVM based speaker recognition system in presence of duration variability.
no code implementations • 12 May 2021 • Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni
Compared to the existing work, which depends on the ASR module, our method relies solely on the acoustic signal.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 11 May 2021 • Premjeet Singh, Goutam Saha, Md Sahidullah
We also investigate layer-wise scattering coefficients to analyse the importance of time shift and deformation stable scalogram and modulation spectrum coefficients for SER.
no code implementations • 10 May 2021 • Spandan Dey, Goutam Saha, Md Sahidullah
In this paper, we conduct one of the very first studies for cross-corpora performance evaluation in the spoken language identification (LID) problem.
no code implementations • 26 Mar 2021 • Bhusan Chettri, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen
Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e. g. synthetic or replayed sample).
no code implementations • 20 Feb 2021 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification.
no code implementations • 11 Feb 2021 • Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee
The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV).
no code implementations • 8 Feb 2021 • Premjeet Singh, Goutam Saha, Md Sahidullah
In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER).
no code implementations • 3 Feb 2021 • Achintya Kumar Sarkar, Md Sahidullah, Zheng-Hua Tan
In this paper, we propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV).
no code implementations • 25 Jan 2021 • A Kishore Kumar, Shefali Waldekar, Goutam Saha, Md Sahidullah
This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge.
no code implementations • 30 Jul 2020 • Xuechen Liu, Md Sahidullah, Tomi Kinnunen
Modern automatic speaker verification relies largely on deep neural networks (DNNs) trained on mel-frequency cepstral coefficient (MFCC) features.
no code implementations • 26 Jul 2020 • Md Sahidullah, Achintya Kumar Sarkar, Ville Vestman, Xuechen Liu, Romain Serizel, Tomi Kinnunen, Zheng-Hua Tan, Emmanuel Vincent
Our primary submission to the challenge is the fusion of seven subsystems which yields a normalized minimum detection cost function (minDCF) of 0. 072 and an equal error rate (EER) of 2. 14% on the evaluation set.
no code implementations • 21 Jul 2020 • Susanta Sarangi, Md Sahidullah, Goutam Saha
Then, we propose a new method for computing the filter frequency responses by using principal component analysis (PCA).
no code implementations • 12 Jul 2020 • Tomi Kinnunen, Héctor Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds
Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs.
no code implementations • 10 Nov 2019 • Brij Mohan Lal Srivastava, Nathalie Vauquier, Md Sahidullah, Aurélien Bellet, Marc Tommasi, Emmanuel Vincent
In this paper, we investigate anonymization methods based on voice conversion.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 6 Nov 2019 • Md Sahidullah, Jose Patino, Samuele Cornell, Ruiqing Yin, Sunit Sivasankaran, Hervé Bredin, Pavel Korshunov, Alessio Brutti, Romain Serizel, Emmanuel Vincent, Nicholas Evans, Sébastien Marcel, Stefano Squartini, Claude Barras
This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team.
no code implementations • 5 Nov 2019 • Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, Zhen-Hua Ling
Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques.
no code implementations • 3 Jun 2019 • Ville Vestman, Tomi Kinnunen, Rosa González Hautamäki, Md Sahidullah
Our goal is to gain insights how well similarity rankings transfer from the attacker's ASV system to the attacked ASV system, whether the attackers are able to improve their attacks by mimicking, and how the properties of the voices of attackers change due to mimicking.
no code implementations • 16 Apr 2019 • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, Jing Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda, Trung Ngo Trong, Md Sahidullah, Fan Lu, Yun Tang, Ming Tu, Kah Kuan Teh, Huy Dat Tran, Kuruvachan K. George, Ivan Kukanov, Florent Desnous, Jichen Yang, Emre Yilmaz, Longting Xu, Jean-Francois Bonastre, Cheng-Lin Xu, Zhi Hao Lim, Eng Siong Chng, Shivesh Ranjan, John H. L. Hansen, Massimiliano Todisco, Nicholas Evans
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE).
no code implementations • 29 Jan 2019 • Arnab Poddar, Md Sahidullah, Goutam Saha
We have used the proposed quality measures as side information for combining ASV systems based on Gaussian mixture model-universal background model (GMM-UBM) and i-vector.
no code implementations • 4 Jan 2019 • Md Sahidullah, Hector Delgado, Massimiliano Todisco, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Kong-Aik Lee
Over the past few years significant progress has been made in the field of presentation attack detection (PAD) for automatic speaker recognition (ASV).
no code implementations • 3 Dec 2018 • Arnab Poddar, Md Sahidullah, Goutam Saha
In experiments with the NIST SRE 2008 corpus, We have shown that inclusion of proposed quality metric exhibits considerable improvement in speaker verification performance.
no code implementations • 9 Nov 2018 • Tomi Kinnunen, Rosa González Hautamäki, Ville Vestman, Md Sahidullah
We consider technology-assisted mimicry attacks in the context of automatic speaker verification (ASV).
1 code implementation • 25 Apr 2018 • Tomi Kinnunen, Kong Aik Lee, Hector Delgado, Nicholas Evans, Massimiliano Todisco, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds
The two challenge editions in 2015 and 2017 involved the assessment of spoofing countermeasures (CMs) in isolation from ASV using an equal error rate (EER) metric.
no code implementations • 22 Dec 2016 • Monisankha Pal, Dipjyoti Paul, Md Sahidullah, Goutam Saha
Most of the existing studies on voice conversion (VC) are conducted in acoustically matched conditions between source and target signal.