In speech processing, keyword spotting deals with the identification of keywords in utterances.
( Image credit: Simon Grest )
Keyword Spotting (KWS) enables speech-based user interaction on smart devices.
While end-to-end learning has become a trend in deep learning, the model architecture is often designed to incorporate domain knowledge.
In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices.
Using Intel's Loihi neuromorphic research chip and ABR's Nengo Deep Learning toolkit, we analyze the inference speed, dynamic power consumption, and energy cost per inference of a two-layer neural network keyword spotter trained to recognize a single phrase.
We explore the application of end-to-end stateless temporal modeling to small-footprint keyword spotting as opposed to recurrent networks that model long-term temporal dependencies using internal states.
The problem of keyword spotting i. e. identifying keywords in a real-time audio stream is mainly solved by applying a neural network over successive sliding windows.
Overall, our robust, cross-device implementation for keyword spotting realizes a new paradigm for serving neural network applications, and one of our slim models reduces latency by 66% with a minimal decrease in accuracy of 4% from 94% to 90%.
We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark.