Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

1 code implementation LREC 2022 Alexandra Benamar, Cyril Grouin, Meryl Bothua, Anne Vilnat

Our experiments have led to exciting findings that showed: (1) It is easier to improve the representation of new words (A and B) than it is for words that already exist in the vocabulary of the Transformer models (C), (2) To ameliorate the representation of OOVs, the most effective method relies on adding external morpho-syntactic context rather than improving the semantic understanding of the words directly (fine-tuning) and (3) We cannot foresee the impact of minor misspellings in words because similar misspellings have different impacts on their representation.

Anatomy Domain Adaptation

MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

no code implementations LEGAL (LREC) 2022 Victoria Arranz, Khalid Choukri, Montse Cuadros, Aitor García Pablos, Lucie Gianola, Cyril Grouin, Manuel Herranz, Patrick Paroubek, Pierre Zweigenbaum

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques.

De-identification named-entity-recognition +2

Impact du français inclusif sur les outils du TAL (Impact of French Inclusive Language on NLP Tools)

no code implementations JEP/TALN/RECITAL 2022 Cyril Grouin

Le français inclusif est une variété du français standard mise en avant pour témoigner d’une conscience de genre et d’identité.

Inference Annotation of a Chinese Corpus for Opinion Mining

no code implementations LREC 2020 Liyun Yan, Danni E, Mei Gan, Cyril Grouin, Mathieu Valette

Polarity classification (positive, negative or neutral opinion detection) is well developed in the field of opinion mining.

Classification General Classification +3

Community Perspective on Replicability in Natural Language Processing

no code implementations RANLP 2019 Margot Mieskes, Kar{\"e}n Fort, Aur{\'e}lie N{\'e}v{\'e}ol, Cyril Grouin, Kevin Cohen

With recent efforts in drawing attention to the task of replicating and/or reproducing results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general.

Clinical Case Reports for NLP

no code implementations WS 2019 Cyril Grouin, Natalia Grabar, Vincent Claveau, Thierry Hamon

Thus, we manually annotated a set of 717 files into four general categories (age, gender, outcome, and origin) for a total number of 2, 835 annotations.

Corpus annot\'e de cas cliniques en fran\ccais (Annotated corpus with clinical cases in French)

no code implementations JEPTALNRECITAL 2019 Natalia Grabar, Cyril Grouin, Thierry Hamon, Vincent Claveau

Pour r{\'e}pondre {\`a} ce d{\'e}fi, nous pr{\'e}sentons dans cet article le corpus CAS contenant des cas cliniques de patients, r{\'e}els ou fictifs, que nous avons compil{\'e}s. Ces cas cliniques en fran{\c{c}}ais couvrent plusieurs sp{\'e}cialit{\'e}s m{\'e}dicales et focalisent donc sur diff{\'e}rentes situations cliniques.

Simplification de sch\'emas d'annotation : un aller sans retour ? (Annotation scheme simplification : a one way trip with no return ?)

no code implementations JEPTALNRECITAL 2018 Cyril Grouin

Nous {\'e}tudions {\'e}galement la possibilit{\'e} de retrouver le niveau de d{\'e}tail des types d{'}EN du sch{\'e}ma d{'}origine {\`a} partir des versions simplifi{\'e}es.


Traitement automatique de la langue biom\'edicale au LIMSI (Biomedical language processing at LIMSI)

no code implementations JEPTALNRECITAL 2017 Christopher Norman, Cyril Grouin, Thomas Lavergne, Aur{\'e}lie N{\'e}v{\'e}ol, Pierre Zweigenbaum

Nous proposons des d{\'e}monstrations de trois outils d{\'e}velopp{\'e}s par le LIMSI en traitement automatique des langues appliqu{\'e} au domaine biom{\'e}dical : la d{\'e}tection de concepts m{\'e}dicaux dans des textes courts, la cat{\'e}gorisation d{'}articles scientifiques pour l{'}assistance {\`a} l{'}{\'e}criture de revues syst{\'e}matiques, et l{'}anonymisation de textes cliniques.

A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage

no code implementations WS 2016 Thomas Lavergne, Aur{\'e}lie N{\'e}v{\'e}ol, Aude Robert, Cyril Grouin, Gr{\'e}goire Rey, Pierre Zweigenbaum

Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English.

Named Entity Recognition (NER)

Detection of Text Reuse in French Medical Corpora

no code implementations WS 2016 Eva D{'}hondt, Cyril Grouin, Aur{\'e}lie N{\'e}v{\'e}ol, Efstathios Stamatatos, Pierre Zweigenbaum

Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals{'} health information systems, or through the digitization of historical paper records.

De-identification Optical Character Recognition (OCR)

Une cat\'egorisation de fins de lignes non-supervis\'ee (End-of-line classification with no supervision)

no code implementations JEPTALNRECITAL 2016 Pierre Zweigenbaum, Cyril Grouin, Thomas Lavergne

Nous proposons une m{\'e}thode enti{\`e}rement non-supervis{\'e}e pour d{\'e}terminer si une fin de ligne doit {\^e}tre vue comme un simple espace ou comme une v{\'e}ritable fronti{\`e}re d{'}unit{\'e} textuelle, et la testons sur un corpus de comptes rendus m{\'e}dicaux.

Text Segmentation of Digitized Clinical Texts

no code implementations LREC 2016 Cyril Grouin

We achieved our best results with a model trained on homogeneous corpora (only files composed of 2 columns) when classifying each token into left or right columns (overall F-measure of 0. 968).

Segmentation Text Segmentation

Controlled Propagation of Concept Annotations in Textual Corpora

no code implementations LREC 2016 Cyril Grouin

In this paper, we presented the annotation propagation tool we designed to be used in conjunction with the BRAT rapid annotation tool.

Identification of Drug-Related Medical Conditions in Social Media

no code implementations LREC 2016 Fran{\c{c}}ois Morlane-Hond{\`e}re, Cyril Grouin, Pierre Zweigenbaum

When trained on the output of the first classifier, the second classifier{'}s performances are the following: p=0. 683;r=0. 956;f1=0. 797.

\'Etude des verbes introducteurs de noms de m\'edicaments dans les forums de sant\'e

no code implementations JEPTALNRECITAL 2015 Fran{\c{c}}ois Morlane-Hond{\`e}re, Cyril Grouin, Pierre Zweigenbaum

Nous estimons que l{'}analyse de ces variantes pourrait permettre de mod{\'e}liser les erreurs faites par les usagers des forums lorsqu{'}ils {\'e}crivent les noms de m{\'e}dicaments, et am{\'e}liorer en cons{\'e}quence les syst{\`e}mes de recherche d{'}information.

Identification de facteurs de risque pour des patients diab\'etiques \`a partir de comptes-rendus cliniques par des approches hybrides

no code implementations JEPTALNRECITAL 2015 Cyril Grouin, V{\'e}ronique Moriceau, Sophie Rosset, Pierre Zweigenbaum

Dans cet article, nous pr{\'e}sentons les m{\'e}thodes que nous avons d{\'e}velopp{\'e}es pour analyser des comptes- rendus hospitaliers r{\'e}dig{\'e}s en anglais.

Morpho-Syntactic Study of Errors from Speech Recognition System

no code implementations LREC 2014 Maria Goryainova, Cyril Grouin, Sophie Rosset, Ioana Vasilescu

The study provides an original standpoint of the speech transcription errors by focusing on the morpho-syntactic features of the erroneous chunks and of the surrounding left and right context.

Named Entity Recognition (NER) POS +3

Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports

no code implementations LREC 2014 Maria Evangelia Chatzimina, Cyril Grouin, Pierre Zweigenbaum

We design and test two syntax-based methods to produce word classes: one applies the Brown clustering algorithm to syntactic dependencies, the other collects latent categories created by a PCFG-LA parser.

Chunking Clustering +2

Annotation of specialized corpora using a comprehensive entity and relation scheme

no code implementations LREC 2014 Louise Del{\'e}ger, Anne-Laure Ligozat, Cyril Grouin, Pierre Zweigenbaum, Aur{\'e}lie N{\'e}v{\'e}ol

We present the annotation scheme as well as the results of a pilot annotation study covering 35 clinical documents in a variety of subfields and genres.


Extended Named Entities Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign

no code implementations LREC 2012 Olivier Galibert, Sophie Rosset, Cyril Grouin, Pierre Zweigenbaum, Ludovic Quintard

Within the framework of the Quaero project, we proposed a new definition of named entities, based upon an extension of the coverage of named entities as well as the structure of those named entities.

Named Entity Recognition (NER) Optical Character Recognition (OCR)

