BillSum

Introduced by Kornilova et al. in BillSum: A Corpus for Automatic Summarization of US Legislation

BillSum is the first dataset for summarization of US Congressional and California state bills.

The BillSum dataset consists of three parts: US training bills, US test bills and California test bills. The US bills were collected from the Govinfo service provided by the United States Government Publishing Office (GPO). The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills. For California, bills from the 2015-2016 session were scraped directly from the legislature’s website; the summaries were written by their Legislative Counsel.

The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 character in length. The authors chose to measure the text length in characters, instead of words or sentences, because the texts have complex structure that makes it difficult to consistently measure words. The range was chosen because on one side, short bills introduce minor changes and do not require summaries. While the CRS produces summaries for them, they often contain most of the text of the bill. On the other side, very long legislation is often composed of several large sections.

Source: BillSum: A Corpus for Automatic Summarization of US Legislation

Homepage