Search Results for author: Marco Cognetta

Found 8 papers, 0 papers with code

Tokenization as Finite-State Transduction

no code implementations21 Oct 2024 Marco Cognetta, Naoaki Okazaki

An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.

Language Modeling Language Modelling

Distributional Properties of Subword Regularization

no code implementations21 Aug 2024 Marco Cognetta, Vilém Zouhar, Naoaki Okazaki

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training.

Machine Translation

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

no code implementations30 Mar 2024 Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.

Machine Translation Translation

Two Counterexamples to Tokenization and the Noiseless Channel

no code implementations22 Feb 2024 Marco Cognetta, Vilém Zouhar, Sangwhan Moon, Naoaki Okazaki

In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen.

Machine Translation

SoftRegex: Generating Regex from Natural Language Descriptions using Softened Regex Equivalence

no code implementations IJCNLP 2019 Jun-U Park, Sang-Ki Ko, Marco Cognetta, Yo-Sub Han

We continue the study of generating se-mantically correct regular expressions from natural language descriptions (NL).

On the Compression of Lexicon Transducers

no code implementations WS 2019 Marco Cognetta, Cyril Allauzen, Michael Riley

Indeed, a delicate balance between comprehensiveness, speed, and memory must be struck to conform to device requirements while providing a good user experience. In this paper, we describe a compression scheme for lexicons when represented as finite-state transducers.

Online Infix Probability Computation for Probabilistic Finite Automata

no code implementations ACL 2019 Marco Cognetta, Yo-Sub Han, Soon Chan Kwon

Probabilistic finite automata (PFAs) are com- mon statistical language model in natural lan- guage and speech processing.

Language Modeling Language Modelling

Incremental Computation of Infix Probabilities for Probabilistic Finite Automata

no code implementations EMNLP 2018 Marco Cognetta, Yo-Sub Han, Soon Chan Kwon

The problem of computing infix probabilities of strings when the pattern distribution is given by a probabilistic context-free grammar or by a probabilistic finite automaton is already solved, yet it was open to compute the infix probabilities in an incremental manner.

Cannot find the paper you are looking for? You can Submit a new open access paper.