StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

PDF Abstract EMNLP (Eval4NLP) 2021 PDF EMNLP (Eval4NLP) 2021 Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


Introduced in the Paper:

StoryDB

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods