TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	Wasserstein Distance (WD)	88.3 ± .5	# 17
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	# Correct Groups	33 ± 2	# 20
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	Fowlkes Mallows Score (FMS)	26.5 ± .2	# 20
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	Adjusted Rand Index (ARI)	8.2 ± .3	# 20
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	Adjusted Mutual Information (AMI)	10.3 ± .3	# 19
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	# Solved Walls	0 ± 0	# 10
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	Wasserstein Distance (WD)	89.5 ± .4	# 18
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	# Correct Groups	22 ± 2	# 22
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	Fowlkes Mallows Score (FMS)	25.1 ± .2	# 21
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	Adjusted Rand Index (ARI)	6.4 ± .3	# 21
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	Adjusted Mutual Information (AMI)	8.1 ± .4	# 21
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	# Solved Walls	0 ± 0	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pre-training-of-deep-bidirectional-protein/task-1-grouping-on-ocw)](https://paperswithcode.com/sota/task-1-grouping-on-ocw?p=pre-training-of-deep-bidirectional-protein)`

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information

25 Nov 2019 · Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Byunghan Lee, Sungroh Yoon ·

Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as PLUS-RNN. Our experiment results demonstrate that PLUS-RNN outperforms other models of similar size solely pre-trained with the language modeling in six out of seven widely used protein biology tasks. Furthermore, we present the results from our qualitative interpretation analyses to illustrate the strengths of PLUS-RNN. PLUS provides a novel way to exploit evolutionary relationships among unlabeled proteins and is broadly applicable across a variety of protein biology tasks. We expect that the gap between the numbers of unlabeled and labeled proteins will continue to grow exponentially, and the proposed pre-training method will play a larger role.

PDF Abstract

Code

Add Remove Mark official

mswzeus/PLUS official

Tasks

Add Remove

Language Modelling

Masked Language Modeling

Only Connect Walls Dataset Task 1 (Grouping)

Datasets

OCW

Results from the Paper

Edit

Ranked #17 on Only Connect Walls Dataset Task 1 (Grouping) on OCW (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (LARGE)	Wasserstein Distance (WD)	88.3 ± .5	# 17	Compare
			# Correct Groups	33 ± 2	# 20	Compare
			Fowlkes Mallows Score (FMS)	26.5 ± .2	# 20	Compare
			Adjusted Rand Index (ARI)	8.2 ± .3	# 20	Compare
			Adjusted Mutual Information (AMI)	10.3 ± .3	# 19	Compare
			# Solved Walls	0 ± 0	# 10	Compare
Only Connect Walls Dataset Task 1 (Grouping)	OCW	BERT (BASE)	Wasserstein Distance (WD)	89.5 ± .4	# 18	Compare
			# Correct Groups	22 ± 2	# 22	Compare
			Fowlkes Mallows Score (FMS)	25.1 ± .2	# 21	Compare
			Adjusted Rand Index (ARI)	6.4 ± .3	# 21	Compare
			Adjusted Mutual Information (AMI)	8.1 ± .4	# 21	Compare
			# Solved Walls	0 ± 0	# 10	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove