SubSumE Dataset

This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.

Dataset Files

Download the dataset from here.

The dataset contains :

  • Simplified text from 48 Wikipedia pages of the states in the US. Additionally, all the sentences in these documents are put together in a single file processed_state_sentences.csv and are assigned a unique sentence id that is used in summary json files.
  • Intent-based summaries created by human annotators.

Each datapoint file in the directory user_summary_jsons contains a json containing summaries of Wikipedia pages of eight states with following keys:

  • intent : Summarization intent provided to human annotators for generating the summary
  • summaries: List of summary jsons for eight states assigned to the annotator. Each json in the list contains following keys:
    • state_name: Name of the state
    • sentence_ids: Global ids of sentences (wrt processed_state_sentences.csv) present in the summary
    • sentences: List of sentences present in the summary
    • use_keywords: Keywords used by the annotator to search the document when creating summaries


