WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
66 PAPERS • 4 BENCHMARKS
OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.
64 PAPERS • NO BENCHMARKS YET
Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.
1 PAPER • NO BENCHMARKS YET
HALvest is a textual dataset comprising 17 billion tokens in 56 languages and 13 domains.
Ancient books script identification of Chinese ethnic minorities with deep convolutional neural networks via multi-branch and spatial pyramid pooling
0 PAPER • NO BENCHMARKS YET