no code implementations • 21 Oct 2024 • Marco Cognetta, Naoaki Okazaki
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
no code implementations • 21 Aug 2024 • Marco Cognetta, Vilém Zouhar, Naoaki Okazaki
Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training.
no code implementations • 30 Mar 2024 • Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.
no code implementations • 22 Feb 2024 • Marco Cognetta, Vilém Zouhar, Sangwhan Moon, Naoaki Okazaki
In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen.
no code implementations • IJCNLP 2019 • Jun-U Park, Sang-Ki Ko, Marco Cognetta, Yo-Sub Han
We continue the study of generating se-mantically correct regular expressions from natural language descriptions (NL).
no code implementations • WS 2019 • Marco Cognetta, Cyril Allauzen, Michael Riley
Indeed, a delicate balance between comprehensiveness, speed, and memory must be struck to conform to device requirements while providing a good user experience. In this paper, we describe a compression scheme for lexicons when represented as finite-state transducers.
no code implementations • ACL 2019 • Marco Cognetta, Yo-Sub Han, Soon Chan Kwon
Probabilistic finite automata (PFAs) are com- mon statistical language model in natural lan- guage and speech processing.
no code implementations • EMNLP 2018 • Marco Cognetta, Yo-Sub Han, Soon Chan Kwon
The problem of computing infix probabilities of strings when the pattern distribution is given by a probabilistic context-free grammar or by a probabilistic finite automaton is already solved, yet it was open to compute the infix probabilities in an incremental manner.