no code implementations • 8 Feb 2022 • Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar, Panayiotis Georgiou
By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time.
no code implementations • 9 Oct 2021 • Ognjen Rudovic, Akanksha Bindal, Vineet Garg, Pramod Simha, Pranay Dighe, Sachin Kajarekar
When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device.
no code implementations • 18 Jun 2021 • Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou, Sachin Kajarekar, Jefferey Bigham
Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work.
no code implementations • 24 Feb 2021 • Colin Lea, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, Jeffrey P. Bigham
The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns.
no code implementations • 20 Oct 2020 • Pranay Dighe, Erik Marchi, Srikanth Vishnubhotla, Sachin Kajarekar, Devang Naik
But in case of a false trigger, transcribing the audio using ASR itself is strongly undesirable.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
no code implementations • 3 Aug 2020 • Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright, Gabriele Fanelli, Justin Binder, Yannis Stylianou, Sachin Kajarekar
The output acoustic features are used to condition a WaveRNN to reconstruct the speech waveform, and the output facial controllers are used to generate the corresponding video of the talking face.
no code implementations • 25 Apr 2020 • Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz
One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications.
no code implementations • 31 Jan 2020 • Vasudha Kowtha, Vikramjit Mitra, Chris Bartels, Erik Marchi, Sue Booker, William Caruso, Sachin Kajarekar, Devang Naik
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
no code implementations • 26 Jan 2020 • Siddharth Sigtia, Erik Marchi, Sachin Kajarekar, Devang Naik, John Bridle
We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker.