SentimentArcs’ reference corpus for novels consists of 25 narratives selected to create a diverse set of well recognized novels that can serve as a benchmark for future studies. The composition of the corpora was limited by the effect of copyright laws as well as historical imbalances. Most works were obtained from US and Australian Gutenberg Projects. The corpora is expected to grow in size and diversity over time.

Several dimensions of diversity were considered for inclusion including popularity, period, genre, topic, style and author diversity. The first version of our corpus includes only English, although Proust and Homer are included in translation. SentimentArcs has processed a larger set of novels, including some in foreign languages. The initial reference corpus is in English since performance across all ensemble models was uneven in less resourced languages

In sum, the corpora includes (1) the two most popular novels on (Project Gutenberg, 2021b), (2) eight of the fifteen most assigned novels at top US universities (EAB, 2021), and (3) three works that have sold over 20 million copies (Books, 2021). There are eight works by women, two by African-Americans and five works by two LGBTQ authors. Britain leads with 15 authors followed by 6 Americans and one each from France, Russia, North Africa and Ancient Greece.


Paper Code Results Date Stars

Dataset Loaders

No data loaders found. You can submit your data loader here.



