Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance.
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment.
The anchored segment refers to the wake-up word part of an audio stream, which contains valuable speaker information that can be used to suppress interfering speech and background noise.
For real-world speech recognition applications, noise robustness is still a challenge.
We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech.
In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants.