XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.
15 PAPERS • NO BENCHMARKS YET
Dataset contains about 48K contracts which are open source on Etherscan.
1 PAPER • NO BENCHMARKS YET
PyTorrent contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure.
4 PAPERS • NO BENCHMARKS YET
CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.
4 PAPERS • 2 BENCHMARKS
This repository holds two datasets: one with both the original binaries and the code sections extracted from them (“full dataset”), and one with only the code sections (“only code sections”). The code sections were extracted by carving out sections of the binary that were marked as executable. The binaries were scraped from Debian repositories.
The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.
66 PAPERS • 1 BENCHMARK