CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https: // .

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Clone Detection CodeXGLUE - BigCloneBench CodeT5 F1 97.2 # 1
Code Summarization CodeXGLUE - CodeSearchNet CodeT5 Ruby 15.73 # 1
Javascript 16.24 # 3
Go 19.76 # 1
Python 20.36 # 3
Java 20.46 # 4
PHP 26.53 # 2
Code Translation CodeXGLUE - CodeTrans CodeT5 BLEU (Java→C#) 84.03 # 1
Accuracy (Java→C#) 65.90 # 1
BLEU (C#→Java) 79.87 # 1
Accuracy (C#→Java) 66.90 # 1
Text-to-Code Generation CodeXGLUE - CONCODE CodeT5 EM 22.70 # 1
BLEU 41.48 # 1
CodeBLEU 44.10 # 1
Defect Detection CodeXGLUE - Devign CodeT5 Accuracy 65.78 # 1
Code Generation CONCODE CodeT5 Exact Match 22.70 # 2
BLEU 41.48 # 2
CodeBLEU 44.10 # 1