Understanding Translationese in Multi-view Embedding Spaces

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data {---} words, parts of speech, semantic tags and synsets {---} to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not {}just{''} due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.

PDF Abstract

No code implementations yet. Submit your code now

Datasets

Add Datasets introduced or used in this paper

Results from the Paper Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.