Search Results for author: Éric de la Clergerie

Found 10 papers, 2 papers with code

Building A Corporate Corpus For Threads Constitution

no code implementations • RANLP 2021 • Lionel Tadonfouet Tadjou, Fabrice Bourge, Tiphaine Marie, Laurent Romary, Éric de la Clergerie

In this paper we describe the process of build-ing a corporate corpus that will be used as a ref-erence for modelling and computing threadsfrom conversations generated using commu-nication and collaboration tools.

Paper
Add Code

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

no code implementations • 11 Apr 2024 • Nathan Godey, Éric de la Clergerie, Benoît Sagot

In this paper, we find that such saturation can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution.

Language Modelling

Paper
Add Code

On the Scaling Laws of Geographical Representation in Language Models

no code implementations • 29 Feb 2024 • Nathan Godey, Éric de la Clergerie, Benoît Sagot

Language models have long been shown to embed geographical information in their hidden representations.

Paper
Add Code

Anisotropy Is Inherent to Self-Attention in Transformers

no code implementations • 22 Jan 2024 • Nathan Godey, Éric de la Clergerie, Benoît Sagot

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers.

Self-Supervised Learning

Paper
Add Code

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

no code implementations • 15 Sep 2023 • Nathan Godey, Éric de la Clergerie, Benoît Sagot

Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies.

LAMBADA

Paper
Add Code

Is Anisotropy Inherent to Transformers?

no code implementations • 13 Jun 2023 • Nathan Godey, Éric de la Clergerie, Benoît Sagot

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers.

Self-Supervised Learning

Paper
Add Code

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

no code implementations • 14 Dec 2022 • Nathan Godey, Roman Castagné, Éric de la Clergerie, Benoît Sagot

The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization.

Language Modelling

Paper
Add Code

Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts

no code implementations • 18 Nov 2020 • Fuqi Song, Éric de la Clergerie

In contract analysis and contract automation, a knowledge base (KB) of legal entities is fundamental for performing tasks such as contract verification, contract generation and contract analytic.

Clustering named-entity-recognition +4

Paper
Add Code

MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

1 code implementation • LREC 2022 • Louis Martin, Angela Fan, Éric de la Clergerie, Antoine Bordes, Benoît Sagot

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English.

Ranked #2 on Text Simplification on ASSET

Parallel Corpus Mining Sentence +1

Paper
Code

Controllable Sentence Simplification

2 code implementations • LREC 2020 • Louis Martin, Benoît Sagot, Éric de la Clergerie, Antoine Bordes

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical.

Ranked #3 on Text Simplification on ASSET

Sentence Text Simplification

102

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.