Multi-modal Self-supervised Pre-training for Large-scale Genome Data

NeurIPS Workshop AI4Scien 2021 · Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Yanyan Lan, Zhiqiang Shen, Eric Xing ·

Open genomic regions, being accessible to regulatory proteins, could act as the on/off switch or amplifier/attenuator of gene expression, and thus reflects the defining characteristics of cell types. Many previous models make predictions from the sequence to the regulatory region, but the interaction between regulatory regions and genes could be complex and differ between cell types. Moreover, current models usually only perform well on the cell types in the training set, which are not generalizable to data-scarce scenarios. In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. Specifically, we simultaneously take the 1d sequence of genome data and a 2d matrix of (transcription factors × regions) as the input, where three pre-training tasks are proposed to improve the robustness and generalizability of our model. We pre-train our model on the ATAC-seq dataset with 17 million gene sequences. We evaluate our GeneBERT on various downstream tasks, including promoter prediction, transaction factor binding sites prediction, disease risks estimation, and RNA-Splicing. Extensive experiments demonstrate the effectiveness of multi-modal and self-supervised pre-training for large-scale genome data.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Multi-modal Self-supervised Pre-training for Large-scale Genome Data

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove