This repository holds two datasets: one with both the original binaries and the code sections extracted from them (“full dataset”), and one with only the code sections (“only code sections”). The code sections were extracted by carving out sections of the binary that were marked as executable. The binaries were scraped from Debian repositories.
There are also two CSV files available, one with full binaries and one with only code sections, which include the 293 features extracted from about 3000 binaries per architecture. These features can be used to train classifiers.
The dataset consists of thousands of binaries for the following 23 architectures: alpha, amd64, arm64, armel, armhf, hppa, i386, ia64, m68k, mips, mips64el, mipsel, powerpc, powerpcspe, powerpc64, powerpc64el, riscv, s390, s390x, sh4, sparc, sparc64 and x32.
There are 98 500 binary files, about 27 gigabytes (uncompressed) of binary files and about 15 gigabytes (uncompressed) of only code sections from those binary files.
Both datasets hold the binaries in directories named by the architecture. The files inside the folders are named as MD5 hashes of the original binary files, and a hash file ending with “.code” contains only the concatenation of all code sections of the original binary file. Each architecture folder also holds a JSON file named after the architecture, e.g. amd64 holds amd64.json. The structure of the JSON file is as follows (described in a JSON Schema-like notation)
This work is based on work by John Clemens, 2015, “Automatic classification of object code using machine learning” and De Nicolao, Pietro et al., 2018, “ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation”
This dataset is released as part of the following papers:
Sami Kairajärvi, Andrei Costin, and Timo Hämäläinen. 2020. ISAdetect: Usable automated detection of ISA (CPU architecture and endianness) for executable binary files and object code. In Tenth ACM Conference on Data and Application Security and Privacy (CODASPY’20), March 16–18, 2020, New Orleans, LA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3374664.3375742
Kairajärvi, Sami, Andrei Costin, and Timo Hämäläinen. "Towards usable automated detection of CPU architecture and endianness for arbitrary binary files and object code sequences." arXiv preprint arXiv:1908.05459 (2019).
Kairajärvi, Sami. "Automatic identification of architecture and endianness using binary file contents." (2019).
The code associated with this dataset can be found at https://github.com/kairis/isadetect
Changelog: version 6 - 29.3.2020
Add Weka models
version 5 - 17.1.2020
Clean up dataset
version 4 - 13.1.2020