4 dataset results for Text Classification AND Russian

Taiga Corpus (An open-source corpus for machine learning.)

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

5 PAPERS • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

RESD (Russian Emotional Speech Dialogs with annotated text)

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>

0 PAPER • NO BENCHMARKS YET

RuADReCT

RuADReCT (The Russian Adverse Drug Reaction Corpus of Tweets)

Created as part of the Social Media Mining for Health Applications (#SMM4H '20) shared tasks, this dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The dataset was a joint effort with the UPenn HLP Center and the Chemoinformatics and Molecular Modeling Research Laboratory at Kazan Federal University.

0 PAPER • NO BENCHMARKS YET

Datasets

4 dataset results for Text Classification AND Russian