End-to-End Speech Recognition Using Connectionist Temporal Classification
Speech recognition on large vocabulary and noisy corpora is challenging for computers. Recent advances have enabled speech recognition systems to be trained end-to-end, instead of relying on complex recognition pipelines. A powerful way to train neural networks that reduces the complexity of the overall system. This thesis describes the development of such an end-to-end trained speech recognition system. It utilizes the connectionist temporal classification (CTC) cost function to evaluate the alignment between an audio signal and a predicted transcription. Multiple variants of this system with different feature representations, increasing network depth and changing recurrent neural network (RNN) cell types are evaluated. Results show that the use of convolutional input layers is advantages, when compared to dense ones. They further suggest that the number of recurrent layers has a significant impact on the results. With the more complex long short-term memory (LSTM) outperforming the faster basic RNN cells, when it comes to learning from long input-sequences. The initial version of the speech recognition system achieves a word error rate (WER) of 27.3%, while the various upgrades decrease it down to 12.6%. All systems are trained end-to-end, therefore, no external language model is used.
- Marc Dangschat