The BDCam\~oes Collection of Portuguese Literary Documents: a Research Resource for Digital Humanities and Language Technology
This paper presents the BDCam{\~o}es Collection of Portuguese Literary Documents, a new corpus of literary texts written in Portuguese that in its inaugural version includes close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a time span from the 16th to the 21st century, and adhering to different orthographic conventions. Many of the texts in the corpus have also been automatically parsed with state-of-the-art language processing tools, forming the BDCam{\~o}es Treebank subcorpus. This set of characteristics makes of BDCam{\~o}es an invaluable resource for research in language technology (e.g. authorship detection, genre classification, etc.) and in language science and digital humanities (e.g. comparative literature, diachronic linguistics, etc.).
PDF Abstract