Are Language-Agnostic Sentence Representations Actually Language-Agnostic?

RANLP 2021  ·  Yu Chen, Tania Avgustinova ·

With the emergence of pre-trained multilingual models, multilingual embeddings have been widely applied in various natural language processing tasks. Language-agnostic models provide a versatile way to convert linguistic units from different languages into a shared vector representation space. The relevant work on multilingual sentence embeddings has reportedly reached low error rate in cross-lingual similarity search tasks. In this paper, we apply the pre-trained embedding models and the cross-lingual similarity search task in diverse scenarios, and observed large discrepancy in results in comparison to the original paper. Our findings on cross-lingual similarity search with different newly constructed multilingual datasets show not only correlation with observable language similarities but also strong influence from factors such as translation paths, which limits the interpretation of the language-agnostic property of the LASER model. %

PDF Abstract RANLP 2021 PDF RANLP 2021 Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here