1 code implementation • LREC 2022 • Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, Tanja Samardzic
We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing.
no code implementations • VarDial (COLING) 2020 • Iuliia Nigmatulina, Tannon Kew, Tanja Samardzic
A formal comparison shows that the system trained on the normalised transcriptions achieves better results in word error rate (WER) (29. 39%) but underperforms at the character level, suggesting dialectal transcriptions offer a viable solution for downstream applications where dialectal differences are important.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
1 code implementation • 6 Mar 2024 • Tanja Samardzic, Ximena Gutierrez, Christian Bentz, Steven Moran, Olga Pelloni
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.
1 code implementation • EACL 2021 • Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques, Tanja Samardzic
We apply our methodology to analyze the model{'}s decisions on three typologically-different languages and find that a) our pattern extraction method applied to cross-attention weights uncovers variation in form of inflection morphemes, b) pattern extraction from self-attention shows triggers for such variation, c) both types of patterns are closely aligned with grammar inflection classes and class assignment criteria, for all three languages.
1 code implementation • EACL 2021 • Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic
The distributions of orthographic word types are very different across languages due to typological characteristics, different writing traditions and potentially other factors.