Search Results for author: Adrien Barbaresi

Found 13 papers, 4 papers with code

Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

1 code implementation ACL 2021 Adrien Barbaresi

The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.

Que rec\`elent les donn\'ees textuelles issues du web ? (What do text data from the Web have to hide ?)

no code implementations JEPTALNRECITAL 2020 Adrien Barbaresi, Ga{\"e}l Lejeune

La collecte et l{'}usage opportunistes de donn{\'e}es textuelles tir{\'e}es du web sont sujets {\`a} une s{\'e}rie de probl{\`e}mes {\'e}thiques, m{\'e}thodologiques et {\'e}pist{\'e}mologiques qui m{\'e}ritent l{'}attention de la communaut{\'e} scientifique.

Out-of-the-Box and into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools

no code implementations LREC 2020 Adrien Barbaresi, Ga{\"e}l Lejeune

This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction?

Discriminating between Similar Languages using Weighted Subword Features

1 code implementation WS 2017 Adrien Barbaresi

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task.

Language Identification Text Categorization

An Unsupervised Morphological Criterion for Discriminating Similar Languages

no code implementations WS 2016 Adrien Barbaresi

In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level.

Language Identification Text Categorization

Cannot find the paper you are looking for? You can Submit a new open access paper.