A blackout poetry dataset constructed from publicly available short stories and large poems. The dataset consists of two variants: 8K and 16K examples of passages along with a poem generated from the passage and the indices of the words in the passage from which words in the poem have been selected. The dataset also contains perplexity scores for each of the poems indicating the language quality of the poems.

The dataset was constructed synthetically, and hence contains multiple poor poems and frequent grammatical errors. However, it is a great starting point for the task of applying machine learning to blackout poetry generation.


  • CC0: Public Domain