Convolutional neural networks (CNNs) with convolutional and pooling
operations along the frequency axis have been proposed to attain invariance to
frequency shifts of features. However, this is inappropriate with regard to the
fact that acoustic features vary in frequency...
In this paper, we contend that
convolution along the time axis is more effective. We also propose the addition
of an intermap pooling (IMP) layer to deep CNNs. In this layer, filters in each
group extract common but spectrally variant features, then the layer pools the
feature maps of each group. As a result, the proposed IMP CNN can achieve
insensitivity to spectral variations characteristic of different speakers and
utterances. The effectiveness of the IMP CNN architecture is demonstrated on
several LVCSR tasks. Even without speaker adaptation techniques, the
architecture achieved a WER of 12.7% on the SWB part of the Hub5'2000
evaluation test set, which is competitive with other state-of-the-art methods.