Neural program embeddings have shown much promise recently for a variety of
program analysis tasks, including program synthesis, program repair, fault
localization, etc. However, most existing program embeddings are based on
syntactic features of programs, such as raw token sequences or abstract syntax
trees. Unlike images and text, a program has an unambiguous semantic meaning
that can be difficult to capture by only considering its syntax (i.e.
syntactically similar pro- grams can exhibit vastly different run-time
behavior), which makes syntax-based program embeddings fundamentally limited.
This paper proposes a novel semantic program embedding that is learned from
program execution traces. Our key insight is that program states expressed as
sequential tuples of live variable values not only captures program semantics
more precisely, but also offer a more natural fit for Recurrent Neural Networks
to model. We evaluate different syntactic and semantic program embeddings on
predicting the types of errors that students make in their submissions to an
introductory programming class and two exercises on the CodeHunt education
platform. Evaluation results show that our new semantic program embedding
significantly outperforms the syntactic program embeddings based on token
sequences and abstract syntax trees. In addition, we augment a search-based
program repair system with the predictions obtained from our se- mantic
embedding, and show that search efficiency is also significantly improved.