EVIL-Decoders

Introduced by Liguori et al. in EVIL: Exploiting Software via Natural Language

This is an assembly dataset built on top of Shellcode_IA32, a dataset for automatically generating assembly from natural language descriptions that consists of 3,200 assembly instructions, commented in the English language, which were collected from shellcodes for IA-32 and written for the Netwide Assembler (NASM) for Linux. In order to make the data more representative of the code that we aim to generate (i.e., complete exploits, inclusive of decoders to be delivered in the shellcode), we enriched the dataset with further samples of assembly code, drawn from the exploits that we collected from public databases. Different from the previous dataset, the new one includes assembly code from real decoders used in actual exploits. The final dataset contains 3,715 unique pairs of assembly code snippets/English intents. To better support developers in the automatic generation of the assembly programs, we looked beyond a one-to-one mapping between natural language intents and their corresponding code. Therefore, the dataset includes 783 lines (~21% of the dataset) of multi-line snippets, i.e., intents that generate multiple lines of assembly code, separated by the newline character (\n). These multi-line snippets contain a number of different assembly instructions that can range between 2 and 5.

Homepage