HALvest is a textual dataset comprising 17 billion tokens in 56 languages and 13 domains.

Although HALvest is mostly in English and French, the gathered 670,861 papers are written in 56 languages across 16 domains for the unfiltered version, accounting for approximately 17 billion tokens. HALvest’s text can also serve as a valuable asset for low-resource languages, hosting documents in Basque, Catalan, or Persian to mention a few.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets