In this paper, we focus on the task of small-footprint keyword spotting under the far-field scenario.
Despite the recent successes of deep neural networks, it remains challenging to achieve high precision keyword spotting task (KWS) on resource-constrained devices.
To improve the robustness, a speech enhancement front-end is involved.
In this paper, we propose an attention-based end-to-end neural approach for small-footprint keyword spotting (KWS), which aims to simplify the pipelines of building a production-quality KWS system.
We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model which jointly learns acoustic and language model components.
Mainly for the sake of solving the lack of keyword-specific data, we propose one Keyword Spotting (KWS) system using Deep Neural Network (DNN) and Connectionist Temporal Classifier (CTC) on power-constrained small-footprint mobile devices, taking full advantage of general corpus from continuous speech recognition which is of great amount.
Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67. 6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
Keyword spotting (KWS) constitutes a major component of human-technology interfaces.