no code implementations • 31 Aug 2023 • Alexandre Bittar, Paul Dixon, Mohammad Samragh, Kumari Nishu, Devang Naik
Using a vision-inspired keyword spotting framework, we propose an architecture with input-dependent dynamic depth capable of processing streaming audio.
no code implementations • 12 Aug 2023 • Kumari Nishu, Minsik Cho, Paul Dixon, Devang Naik
Spotting user-defined/flexible keywords represented in text frequently uses an expensive text encoder for joint analysis with an audio encoder in an embedding space, which can suffer from heterogeneous modality representation (i. e., large mismatch) and increased complexity.
no code implementations • 27 May 2020 • Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, Sachin Kajareker
We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout.
no code implementations • 15 May 2019 • Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele Fanelli, Paul Dixon, Nicholas Apostoloff, Thibaut Weise, Sachin Kajareker
We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3