Speech-MLP: a simple MLP architecture for speech processing

29 Sep 2021 · Chao Xing, Dong Wang, LiRong Dai, Qun Liu, Anderson Avila ·

Overparameterized transformer-based architectures have shown remarkable performance in recent years, achieving state-of-the-art results in speech processing tasks such as speech recognition, speech synthesis, keyword spotting, and speech enhancement et al. The main assumption is that with the underlying self-attention mechanism, transformers can ultimately capture the long-range temporal dependency from speech signals. In this paper, we propose a multi-layer perceptron (MLP) architecture, namely speech-MLP, useful for extracting information from speech signals. The model splits feature channels into non-overlapped chunks and processes each chunk individually. The processed chunks are then merged together and processed to consolidate the output. By setting the different numbers of chunks and focusing on different contextual window sizes, speech-MLP learns multiscale local temporal dependency. The proposed model is successfully evaluated on two tasks: keyword spotting and speech enhancement. In our experiments, we use two benchmark datasets for keyword spotting (Google speech command V2-35 and LibriWords) and the VoiceBank dataset for the speech enhancement task. In all experiments, speech-MLP surpassed transformer-based solutions, achieving state-of-the-art performance with fewer parameters and simpler training schemes. Such results indicate that oftentimes more complex models such as transformers are not necessary for speech processing tasks. Hence, they should not be considered as the first option as simpler and more compact models can offer optimal performance.

PDF Abstract