C2SLR: Consistency-Enhanced Continuous Sign Language Recognition

CVPR 2022  ·  Ronglai Zuo, Brian Mak ·

The backbone of most deep-learning-based continuous sign language recognition (CSLR) models consists of a visual module, a sequential module, and an alignment module. However, such CSLR backbones are hard to be trained sufficiently with a single connectionist temporal classification loss. In this work, we propose two auxiliary constraints to enhance the CSLR backbones from the perspective of consistency. The first constraint aims to enhance the visual module, which easily suffers from the insufficient training problem. Specifically, since sign languages convey information mainly with signers' faces and hands, we insert a keypoint-guided spatial attention module into the visual module to enforce it to focus on informative regions, i.e., spatial attention consistency. Nevertheless, only enhancing the visual module may not fully exploit the power of the backbone. Motivated by that both the output features of the visual and sequential modules represent the same sentence, we further impose a sentence embedding consistency constraint between them to enhance the representation power of both the features. Experimental results over three representative backbones validate the effectiveness of the two constraints. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on three benchmarks, PHOENIX-2014, PHOENIX-2014-T, and CSL.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Sign Language Recognition RWTH-PHOENIX-Weather 2014 C2SLR Word Error Rate (WER) 20.4 # 5
Sign Language Recognition RWTH-PHOENIX-Weather 2014 T C2SLR Word Error Rate (WER) 20.4 # 3