In this paper, we propose novel stochastic modeling of various components of a continuous sign language recognition (CSLR) system that is based on the transformer encoder and connectionist temporal classification (CTC).
Sign languages are visual languages which convey information by signers' handshape, facial expression, body movement, and so forth.
Ranked #1 on Sign Language Recognition on WLASL-2000
Most lip-to-speech (LTS) synthesis models are trained and evaluated under the assumption that the audio-video pairs in the dataset are perfectly synchronized.
We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing.
Ranked #7 on Sign Language Recognition on CSL-Daily
RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding.
The backbone of most deep-learning-based continuous sign language recognition (CSLR) models consists of a visual module, a sequential module, and an alignment module.
One of the current state-of-the-art multilingual document embedding model LASER is based on the bidirectional LSTM neural machine translation model.
Speaker similarity is good for native speech from native speakers.
This paper further adds a distance constraint to the training objective function of NV so that the two embeddings of a parallel document are required to be as close as possible.
In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector.