TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Core Promoter Detection	GUE	DNABERT-2-117M	MCC	70.52	# 1
Transcription Factor Binding Site Prediction (Mouse)	GUE	DNABERT-2-117M	MCC	67.99	# 1
Transcription Factor Binding Site Prediction (Human)	GUE	DNABERT-2-117M	MCC	70.10	# 1
Covid Variant Prediction	GUE	DNABERT-2-117M	Avg F1	71.02	# 1
Epigenetic Marks Prediction	GUE	DNABERT-2-117M	MCC	55.98	# 1
Splice Site Prediction	GUE	DNABERT-2-117M	MCC	84.99	# 1
Promoter Detection	GUE	DNABERT-2-117M	MCC	84.21	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/core-promoter-detection-on-gue)](https://paperswithcode.com/sota/core-promoter-detection-on-gue?p=dnabert-2-efficient-foundation-model-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/transcription-factor-binding-site-prediction-1)](https://paperswithcode.com/sota/transcription-factor-binding-site-prediction-1?p=dnabert-2-efficient-foundation-model-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/transcription-factor-binding-site-prediction)](https://paperswithcode.com/sota/transcription-factor-binding-site-prediction?p=dnabert-2-efficient-foundation-model-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/covid-variant-prediction-on-gue)](https://paperswithcode.com/sota/covid-variant-prediction-on-gue?p=dnabert-2-efficient-foundation-model-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/epigenetic-marks-prediction-on-gue)](https://paperswithcode.com/sota/epigenetic-marks-prediction-on-gue?p=dnabert-2-efficient-foundation-model-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/splice-site-prediction-on-gue)](https://paperswithcode.com/sota/splice-site-prediction-on-gue?p=dnabert-2-efficient-foundation-model-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dnabert-2-efficient-foundation-model-and/promoter-detection-on-gue)](https://paperswithcode.com/sota/promoter-detection-on-gue?p=dnabert-2-efficient-foundation-model-and)`

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

26 Jun 2023 · Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, Han Liu ·

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.

PDF Abstract

Code

Add Remove Mark official

magics-lab/dnabert_2 official

174

zhihan1996/dnabert_2 official

174

jerryji1993/dnabert

543

frederikkemarin/bend

Tasks

Add Remove

Computational Efficiency

Core Promoter Detection

Covid Variant Prediction

Data Compression

DNA analysis

Epigenetic Marks Prediction

Genome Understanding

Promoter Detection

Splice Site Prediction

Transcription Factor Binding Site Prediction

Transcription Factor Binding Site Prediction (Human)

Transcription Factor Binding Site Prediction (Mouse)

Datasets

Introduced in the Paper:

GUE

Results from the Paper

Edit

Ranked #1 on Core Promoter Detection on GUE

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Core Promoter Detection	GUE	DNABERT-2-117M	MCC	70.52	# 1	Compare
Transcription Factor Binding Site Prediction (Mouse)	GUE	DNABERT-2-117M	MCC	67.99	# 1	Compare
Transcription Factor Binding Site Prediction (Human)	GUE	DNABERT-2-117M	MCC	70.10	# 1	Compare
Covid Variant Prediction	GUE	DNABERT-2-117M	Avg F1	71.02	# 1	Compare
Epigenetic Marks Prediction	GUE	DNABERT-2-117M	MCC	55.98	# 1	Compare
Splice Site Prediction	GUE	DNABERT-2-117M	MCC	84.99	# 1	Compare
Promoter Detection	GUE	DNABERT-2-117M	MCC	84.21	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove