The dataset was constructed by first finding suitable publications and then collecting keyphrases from manual annotators. Google SOAP API was used to find documents using variants of the query “keywords general terms filetype:pdf”. Over 250 of these PDF documents were downloaded for further processing. Documents were then manually restricted to scientific conference papers, with a length range of 4-12 pages. The PDF documents were then converted to plain text using the PDF995 software suite (as it handled two-columned text better than other programs tried). At the end of this process, 211 documents in plain text format were selected which were converted successfully without problems. The authors then recruited student volunteers from our department to participate in manual keyphrase assignments. Each volunteer was given three PDF files (with author-assigned keyphrases hidden) to assign keyphrases to.


