Combining Residual Networks with LSTMs for Lipreading

12 Mar 2017  ·  Themos Stafylakis, Georgios Tzimiropoulos ·

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Lipreading Lip Reading in the Wild 3D Conv + ResNet-34 + Bi-LSTM Top-1 Accuracy 83.00 # 14

Methods


No methods listed for this paper. Add relevant methods here