Connectionist temporal classification (CTC) is a popular sequence prediction
approach for automatic speech recognition that is typically used with models
based on recurrent neural networks (RNNs). We explore whether deep
convolutional neural networks (CNNs) can be used effectively instead of RNNs as
the "encoder" in CTC...
CNNs lack an explicit representation of the entire
sequence, but have the advantage that they are much faster to train. We present
an exploration of CNNs as encoders for CTC models, in the context of
character-based (lexicon-free) automatic speech recognition. In particular, we
explore a range of one-dimensional convolutional layers, which are particularly
efficient. We compare the performance of our CNN-based models against typical
RNNbased models in terms of training time, decoding time, model size and word
error rate (WER) on the Switchboard Eval2000 corpus. We find that our CNN-based
models are close in performance to LSTMs, while not matching them, and are much
faster to train and decode.