Search Results for author: Christian Bentz

Found 12 papers, 2 papers with code

TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP

no code implementations • LREC 2022 • Steven Moran, Christian Bentz, Ximena Gutierrez-Vasques, Olga Pelloni, Tanja Samardzic

We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing.

Multilingual NLP

Paper
Add Code

A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets

no code implementations • 6 Mar 2024 • Tanja Samardzic, Ximena Gutierrez, Christian Bentz, Steven Moran, Olga Pelloni

Typologically diverse benchmarks are increasingly created to track the progress achieved in multilingual NLP.

Multilingual NLP

Paper
Add Code

The optimality of word lengths. Theoretical foundations and an empirical study

2 code implementations • 22 Aug 2022 • Sonia Petrini, Antoni Casas-i-Muñoz, Jordi Cluet-i-Martinell, Mengxue Wang, Christian Bentz, Ramon Ferrer-i-Cancho

Zipf's law of abbreviation, namely the tendency of more frequent words to be shorter, has been viewed as a manifestation of compression, i. e. the minimization of the length of forms -- a universal principle of natural communication.

Paper
Code

From characters to words: the turning point of BPE merges

1 code implementation • EACL 2021 • Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic

The distributions of orthographic word types are very different across languages due to typological characteristics, different writing traditions and potentially other factors.

Paper
Code

Grammatical error detection in transcriptions of spoken English

no code implementations • COLING 2020 • Andrew Caines, Christian Bentz, Kate Knill, Marek Rei, Paula Buttery

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics.

Grammatical Error Detection

Paper
Add Code

Optimal coding and the origins of Zipfian laws

no code implementations • 4 Jun 2019 • Ramon Ferrer-i-Cancho, Christian Bentz, Caio Seguin

Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter.

Paper
Add Code

Using Universal Dependencies in cross-linguistic complexity research

no code implementations • WS 2018 • Aleks Berdicevskis, rs, {\c{C}}a{\u{g}}r{\i} {\c{C}}{\"o}ltekin, Katharina Ehret, Kilu von Prince, Daniel Ross, Bill Thompson, Chunxiao Yan, Vera Demberg, Gary Lupyan, Taraka Rama, Christian Bentz

We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks.

Paper
Add Code

Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence

no code implementations • WS 2016 • Christian Bentz, Aleks Berdicevskis, rs

The morphological complexity of languages differs widely and changes over time.

Paper
Add Code

A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora

no code implementations • WS 2016 • Christian Bentz, Tatyana Ruzsics, Alex Koplenig, er, Tanja Samard{\v{z}}i{\'c}

Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing.

Machine Translation Word Alignment

Paper
Add Code

The word entropy of natural languages

no code implementations • 22 Jun 2016 • Christian Bentz, Dimitrios Alikaniotis

The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics.

Paper
Add Code

Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora

no code implementations • LREC 2016 • Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl, Paula Buttery

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED{\_}ENGLISH), and a corpus of German/English bilinguals (CROWDED{\_}BILINGUAL).

Sentence valid

Paper
Add Code

Towards a computational model of grammaticalization and lexical diversity

no code implementations • WS 2014 • Christian Bentz, Paula Buttery

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.