Code Search
58 papers with code • 7 benchmarks • 14 datasets
The goal of Code Search is to retrieve code fragments from a large code corpus that most closely match a developer’s intent, which is expressed in natural language.
Libraries
Use these libraries to find Code Search models and implementationsMost implemented papers
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Benchmark datasets have a significant impact on accelerating research in programming language tasks.
When Deep Learning Met Code Search
Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.
CoNCRA: A Convolutional Neural Network Code Retrieval Approach
We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA).
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks.
Memorization and Generalization in Neural Code Intelligence Models
The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models.
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task.
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining.
MELT: Mining Effective Lightweight Transformations from Pull Requests
By leveraging code examples mined from the library source and automatically generated code examples based on the pull requests, we infer transformation rules in \comby, a language for structural code search and replace.