WikiAnn is a dataset for cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
54 PAPERS • 7 BENCHMARKS
Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.
6 PAPERS • NO BENCHMARKS YET
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.
3 PAPERS • NO BENCHMARKS YET
We provide a Mikolov-style word-analogy evaluation set specifically for Bangla, with a sample size of 16678, as well as a translated and curated version of the Mikolov dataset, which contains 10594 samples for cross-lingual research.
1 PAPER • NO BENCHMARKS YET
0 PAPER • NO BENCHMARKS YET