Speech Recognition

XLSR is a multilingual speech recognition model built on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. A shared quantization module over feature encoder representations produces multilingual quantized speech units whose embeddings are then used as targets for a Transformer trained by contrastive learning. The model learns to share discrete tokens across languages, creating bridges across languages.

Source: Unsupervised Cross-lingual Representation Learning for Speech Recognition


Paper Code Results Date Stars


Task Papers Share
Speech Recognition 5 50.00%
Automatic Speech Recognition 2 20.00%
Language Modelling 1 10.00%
Cross-Lingual Transfer 1 10.00%
Quantization 1 10.00%


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign