The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
Ranked #1 on Lipreading on LRS3-TED (using extra training data)
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
Ranked #1 on Audio-Visual Speech Recognition on LRS3-TED
Recent 2D-to-3D human pose estimation works tend to utilize the graph structure formed by the topology of the human skeleton.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets regardless of their taxonomy labels, and followed by fine-tuning the pretrained model over specific dataset as usual.
We propose a benchmark and a suite of evaluation metrics, some of which reflect the effect of detection on the downstream fingerspelling recognition task.
We derive a universal correction to the ground-state entanglement entropy, which is equal to the logarithm of the total quantum dimension of a set of superselection sectors localized on the domain wall.
Strongly Correlated Electrons High Energy Physics - Theory Quantum Physics
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs.
Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution.
However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware.
In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media.
Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems.
In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models.
This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset.
Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision.
As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions.