XLSR is a multilingual speech recognition model built on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. A shared quantization module over feature encoder representations produces multilingual quantized speech units whose embeddings are then used as targets for a Transformer trained by contrastive learning. The model learns to share discrete tokens across languages, creating bridges across languages.
Source: Unsupervised Cross-lingual Representation Learning for Speech RecognitionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Speech Recognition | 15 | 26.79% |
Automatic Speech Recognition (ASR) | 7 | 12.50% |
Language Modeling | 4 | 7.14% |
Language Modelling | 4 | 7.14% |
Translation | 3 | 5.36% |
Language Identification | 2 | 3.57% |
Spoken language identification | 2 | 3.57% |
Cross-Lingual Transfer | 2 | 3.57% |
Speech Representation Learning | 2 | 3.57% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |