# BANANA: a Benchmark for the Assessment of Neural Architectures for Nucleic Acids

29 Sep 2021  ·  , , ·

Machine learning has always played an important role in bioinformatics and recent applications of deep learning have allowed solving a new spectrum of biologically relevant tasks. However, there is still a gap between the mainstream'' AI and the bioinformatics communities. This is partially due to the format of bioinformatics data, which are typically difficult to process and adapt to machine learning tasks without deep domain knowledge. Moreover, the lack of standardized evaluation methods makes it difficult to rigorously compare different models and assess their true performance. To help to bridge this gap, and inspired by work such as SuperGLUE and TAPE, we present BANANA, a benchmark consisting of six supervised classification tasks designed to assess language model performance in the DNA and RNA domains. The tasks are defined over three genomics and one transcriptomics languages (human DNA, bacterial 16S gene, nematoda ITS2 gene, human mRNA) and measure a model's ability to perform whole-sequence classification in a variety of setups. Each task was built from readily available data and is presented in a ready-to-use format, with defined labels, splits, and evaluation metrics. We use BANANA to test state-of-the-art NLP architectures, such as Transformer-based models, observing that, in general, self-supervised pretraining without external corpora is beneficial in every task.

PDF Abstract

## Code Add Remove Mark official

No code implementations yet. Submit your code now

## Results from the Paper Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.