Shellcode_IA32

Introduced by Liguori et al. in Shellcode_IA32: A Dataset for Automatic Shellcode Generation

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes in assembly available to date.

This dataset consists of 3,200 examples of instructions in assembly language for IA-32 (the 32-bit version of the x86 Intel Architecture) from publicly available security exploits. We collected assembly programs used to generate shellcode from exploit-db and from shell-storm. We enriched the dataset by adding examples of assembly programs for the IA-32 architecture from popular tutorials and books. This allowed us to understand how different authors and assembly experts comment and, thus, how to deal with the ambiguity of natural language in this specific context. Our dataset consists of 10% of instructions collected from books and guidelines, and the rest from real shellcodes.

Our focus is on Linux, the most common OS for security-critical network services. Accordingly, we added assembly instructions written with Netwide Assembler (NASM) for Linux.

Each line of Shellcode_IA32 dataset represents a snippet - intent pair. The snippet is a line or a combination of multiple lines of assembly code, built by following the NASM syntax. The intent is a comment in the English language.

Further statistics on the dataset and a set of preliminary experiments performed with a neural machine translation (NMT) model are described in the following paper: Shellcode_IA32: A Dataset for Automatic Shellcode Generation.

Homepage