Search Results for author: Christian Bentz

Found 12 papers, 4 papers with code

The optimality of word lengths. Theoretical foundations and an empirical study

2 code implementations22 Aug 2022 Sonia Petrini, Antoni Casas-i-Muñoz, Jordi Cluet-i-Martinell, Mengxue Wang, Christian Bentz, Ramon Ferrer-i-Cancho

Zipf's law of abbreviation, namely the tendency of more frequent words to be shorter, has been viewed as a manifestation of compression, i. e. the minimization of the length of forms -- a universal principle of natural communication.

From characters to words: the turning point of BPE merges

1 code implementation EACL 2021 Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, Tanja Samardzic

The distributions of orthographic word types are very different across languages due to typological characteristics, different writing traditions and potentially other factors.

Diversity

Grammatical error detection in transcriptions of spoken English

no code implementations COLING 2020 Andrew Caines, Christian Bentz, Kate Knill, Marek Rei, Paula Buttery

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics.

Grammatical Error Detection

Optimal coding and the origins of Zipfian laws

no code implementations4 Jun 2019 Ramon Ferrer-i-Cancho, Christian Bentz, Caio Seguin

Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter.

The word entropy of natural languages

no code implementations22 Jun 2016 Christian Bentz, Dimitrios Alikaniotis

The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics.

Semantic Similarity Semantic Textual Similarity +1

Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora

no code implementations LREC 2016 Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl, Paula Buttery

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED{\_}ENGLISH), and a corpus of German/English bilinguals (CROWDED{\_}BILINGUAL).

Sentence valid

Cannot find the paper you are looking for? You can Submit a new open access paper.