The CodeSearchNet Corpus is a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. The CodeSearchNet Corpus includes: * Six million methods overall * Two million of which have associated documentation (docstrings, JavaDoc, and more) * Metadata that indicates the original location (repository or line number, for example) where the data was found
154 PAPERS • 12 BENCHMARKS
PyTorrent contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure.
3 PAPERS • NO BENCHMARKS YET
The Java dataset introduced in DeepCom (Deep Code Comment Generation), commonly used to evaluate automated code summarization.
2 PAPERS • 1 BENCHMARK
Inspired by Wang et al. 2021, we decided to utilize the top-voted and well-documented Kaggle notebooks to construct the notebookCDGdataset
2 PAPERS • NO BENCHMARKS YET
The Java dataset introduced in Hybrid-DeepCom (Deep code comment generation with hybrid lexical and syntactical information), commonly used to evaluate automated code summarization. It is basically a further version of DeepCom-Java.
1 PAPER • 1 BENCHMARK