Inspired by Wang et al. 2021, we decided to utilize the top-voted and well-documented Kaggle notebooks to construct the notebookCDGdataset
We collected the top 10% highly-voted notebooks from the top 20 popular competitions on Kaggle (e.g. Titanic). We checked the data policy of each of the 20 competitions, none of them has copyright issues. We also contacted the Kaggle administrators to make sure our data collection complies with the platform’s policy.
In total, we collected 3,944 notebooks as raw data. After data preprocessing, the final dataset contains 2,476 notebooks out of the 3,944 notebooks from the raw data. It has 28,625 code–documentation pairs. The overall code-to-markdown ratio is 2.2195