Another Point of View on Visual Speech Recognition

Standard Visual Speech Recognition (VSR) systems directly process images as input features without any apriori link between raw pixel data and facial traits. Pixel information is smartly sieved when facial landmarks are extracted from pictures and repurposed as graph nodes. Their evolution through time is thus modeled by a Graph Convolutional Network. However, with graph-based VSR being in its infancy, the selection of points and their correlation are still ill-defined and often bound to aprioristic knowledge and handcrafted techniques. In this paper, we investigate the graph approach for VSR and its ability to learn the correlation between points beyond the mouth region. We also study the different contributions that each facial region brings to the system accuracy, proving that more scattered but better connected graphs can be both computationally light and accurate.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Landmark-based Lipreading LRW Another Point of View Top 1 Accuracy 62.7 # 3

Methods


No methods listed for this paper. Add relevant methods here