no code implementations • 11 Apr 2024 • Nathan Godey, Éric de la Clergerie, Benoît Sagot
In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution.
no code implementations • 29 Feb 2024 • Nathan Godey, Éric de la Clergerie, Benoît Sagot
Language models have long been shown to embed geographical information in their hidden representations.
no code implementations • 22 Jan 2024 • Nathan Godey, Éric de la Clergerie, Benoît Sagot
The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers.
no code implementations • 15 Sep 2023 • Nathan Godey, Éric de la Clergerie, Benoît Sagot
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies.
no code implementations • 13 Jun 2023 • Nathan Godey, Éric de la Clergerie, Benoît Sagot
The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers.
no code implementations • 14 Dec 2022 • Nathan Godey, Roman Castagné, Éric de la Clergerie, Benoît Sagot
The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization.