1 code implementation • LREC 2022 • Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, Tanja Samardzic
We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing.
1 code implementation • 6 Mar 2024 • Tanja Samardzic, Ximena Gutierrez, Christian Bentz, Steven Moran, Olga Pelloni
Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.
2 code implementations • 22 Aug 2022 • Sonia Petrini, Antoni Casas-i-Muñoz, Jordi Cluet-i-Martinell, Mengxue Wang, Christian Bentz, Ramon Ferrer-i-Cancho
Zipf's law of abbreviation, namely the tendency of more frequent words to be shorter, has been viewed as a manifestation of compression, i. e. the minimization of the length of forms -- a universal principle of natural communication.
1 code implementation • EACL 2021 • Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic
The distributions of orthographic word types are very different across languages due to typological characteristics, different writing traditions and potentially other factors.
no code implementations • COLING 2020 • Andrew Caines, Christian Bentz, Kate Knill, Marek Rei, Paula Buttery
We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics.
no code implementations • 4 Jun 2019 • Ramon Ferrer-i-Cancho, Christian Bentz, Caio Seguin
Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter.
no code implementations • WS 2018 • Aleks Berdicevskis, rs, {\c{C}}a{\u{g}}r{\i} {\c{C}}{\"o}ltekin, Katharina Ehret, Kilu von Prince, Daniel Ross, Bill Thompson, Chunxiao Yan, Vera Demberg, Gary Lupyan, Taraka Rama, Christian Bentz
We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks.
no code implementations • WS 2016 • Christian Bentz, Aleks Berdicevskis, rs
The morphological complexity of languages differs widely and changes over time.
no code implementations • WS 2016 • Christian Bentz, Tatyana Ruzsics, Alex Koplenig, er, Tanja Samard{\v{z}}i{\'c}
Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing.
no code implementations • 22 Jun 2016 • Christian Bentz, Dimitrios Alikaniotis
The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics.
no code implementations • LREC 2016 • Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl, Paula Buttery
We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED{\_}ENGLISH), and a corpus of German/English bilinguals (CROWDED{\_}BILINGUAL).