Search Results for author: Éric de la Clergerie

Found 10 papers, 2 papers with code

Building A Corporate Corpus For Threads Constitution

no code implementations RANLP 2021 Lionel Tadonfouet Tadjou, Fabrice Bourge, Tiphaine Marie, Laurent Romary, Éric de la Clergerie

In this paper we describe the process of build-ing a corporate corpus that will be used as a ref-erence for modelling and computing threadsfrom conversations generated using commu-nication and collaboration tools.

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

no code implementations11 Apr 2024 Nathan Godey, Éric de la Clergerie, Benoît Sagot

In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution.

Language Modelling

On the Scaling Laws of Geographical Representation in Language Models

no code implementations29 Feb 2024 Nathan Godey, Éric de la Clergerie, Benoît Sagot

Language models have long been shown to embed geographical information in their hidden representations.

Anisotropy Is Inherent to Self-Attention in Transformers

no code implementations22 Jan 2024 Nathan Godey, Éric de la Clergerie, Benoît Sagot

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers.

Self-Supervised Learning

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

no code implementations15 Sep 2023 Nathan Godey, Éric de la Clergerie, Benoît Sagot

Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies.

LAMBADA

Is Anisotropy Inherent to Transformers?

no code implementations13 Jun 2023 Nathan Godey, Éric de la Clergerie, Benoît Sagot

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers.

Self-Supervised Learning

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

no code implementations14 Dec 2022 Nathan Godey, Roman Castagné, Éric de la Clergerie, Benoît Sagot

The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization.

Language Modelling

Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts

no code implementations18 Nov 2020 Fuqi Song, Éric de la Clergerie

In contract analysis and contract automation, a knowledge base (KB) of legal entities is fundamental for performing tasks such as contract verification, contract generation and contract analytic.

Clustering named-entity-recognition +4

MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

1 code implementation LREC 2022 Louis Martin, Angela Fan, Éric de la Clergerie, Antoine Bordes, Benoît Sagot

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English.

Parallel Corpus Mining Sentence +1

Controllable Sentence Simplification

2 code implementations LREC 2020 Louis Martin, Benoît Sagot, Éric de la Clergerie, Antoine Bordes

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical.

Sentence Text Simplification

Cannot find the paper you are looking for? You can Submit a new open access paper.