CuBERT

Introduced by Kanade et al. in Learning and Evaluating Contextual Embedding of Source Code

CuBERT, or Code Understanding BERT, is a BERT based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of Allamanis (2018). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).

Source: Learning and Evaluating Contextual Embedding of Source Code

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Code Search	1	8.33%
Code Translation	1	8.33%
Language Modelling	1	8.33%
Type prediction	1	8.33%
Contextual Embedding for Source Code	1	8.33%
Exception type	1	8.33%
Function-docstring mismatch	1	8.33%
Natural Language Understanding	1	8.33%
Program Repair	1	8.33%