BANANA: a Benchmark for the Assessment of Neural Architectures for Nucleic Acids

29 Sep 2021 · Luca Salvatore Lorello, Andrea Galassi, Paolo Torroni ·

Machine learning has always played an important role in bioinformatics and recent applications of deep learning have allowed solving a new spectrum of biologically relevant tasks. However, there is still a gap between the ``mainstream'' AI and the bioinformatics communities. This is partially due to the format of bioinformatics data, which are typically difficult to process and adapt to machine learning tasks without deep domain knowledge. Moreover, the lack of standardized evaluation methods makes it difficult to rigorously compare different models and assess their true performance. To help to bridge this gap, and inspired by work such as SuperGLUE and TAPE, we present BANANA, a benchmark consisting of six supervised classification tasks designed to assess language model performance in the DNA and RNA domains. The tasks are defined over three genomics and one transcriptomics languages (human DNA, bacterial 16S gene, nematoda ITS2 gene, human mRNA) and measure a model's ability to perform whole-sequence classification in a variety of setups. Each task was built from readily available data and is presented in a ready-to-use format, with defined labels, splits, and evaluation metrics. We use BANANA to test state-of-the-art NLP architectures, such as Transformer-based models, observing that, in general, self-supervised pretraining without external corpora is beneficial in every task.

PDF Abstract