Analysing Features Learned Using Unsupervised Models on Program Embeddings

1 Jan 2021 · Martina Saletta, Claudio Ferretti ·

In this paper, we propose a novel approach for analyzing and evaluating how a deep neural network is autonomously learning different features related to programs on different input representations. We trained a simple autoencoder having 5 hidden layers on a dataset containing Java programs, and we tested the ability of each of its neurons in detecting different program features using only unlabeled data for the training phase. For doing that, we designed two binary classification problems having different scopes: while the first one is based on the program cyclomatic complexity, the other one is defined starting from the identifiers chosen by the programmers, making it more related to the functionality (and thus, to some extent, to the semantic) of the program than to its structure. Using different program vector representations as input, we performed experiments considering the two problems, showing how some neurons can be effectively used as classifiers for programs on different binary tasks. We also discuss how the program representation chosen as input affects the classification performance, stating that new and customized program embeddings could be designed in order to obtain models able to solve different tasks guided by the proposed benchmarking approach.

PDF Abstract