Unpaired data has shown to be beneficial for low-resource automatic speech recognition~(ASR), which can be involved in the design of hybrid models with multi-task training or language model dependent pre-training.
The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy.
In this work, we therefore first analyze the noise robustness of wav2vec2. 0 via experiments.
In this paper, we propose a weakly supervised multilingual representation learning framework, called cross-lingual self-training (XLST).