TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Protein Secondary Structure Prediction	CASP12	DistilProtBert	Q3	0.72	# 4
Protein Secondary Structure Prediction	CB513	DistilProtBert	Q3	0.79	# 6
Protein Secondary Structure Prediction	TS115	DistilProtBert	Q3	0.81	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilprotbert-a-distilled-protein-language/protein-secondary-structure-prediction-on-5)](https://paperswithcode.com/sota/protein-secondary-structure-prediction-on-5?p=distilprotbert-a-distilled-protein-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilprotbert-a-distilled-protein-language/protein-secondary-structure-prediction-on-6)](https://paperswithcode.com/sota/protein-secondary-structure-prediction-on-6?p=distilprotbert-a-distilled-protein-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/distilprotbert-a-distilled-protein-language/protein-secondary-structure-prediction-on-1)](https://paperswithcode.com/sota/protein-secondary-structure-prediction-on-1?p=distilprotbert-a-distilled-protein-language)`

DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts

bioRxiv 2022 · Yaron Geffen, Yanay Ofran, Ron Unger ·

Recently, Deep Learning models, initially developed in the field of Natural Language Processing (NLP), were applied successfully to analyze protein sequences. A major drawback of these models is their size in terms of the number of parameters needed to be fitted and the amount of computational resources they require. Recently, "distilled" models using the concept of student and teacher networks have been widely used in NLP. Here, we adapted this concept to the problem of protein sequence analysis, by developing DistilProtBert, a distilled version of the successful ProtBert model. Implementing this approach, we reduced the size of the network and the running time by 50%, and the computational resources needed for pretraining by 98% relative to ProtBert model. Using two published tasks, we showed that the performance of the distilled model approaches that of the full model. We next tested the ability of DistilProtBert to distinguish between real and random protein sequences. The task is highly challenging if the composition is maintained on the level of singlet, doublet and triplet amino acids. Indeed, traditional machine learning algorithms have difficulties with this task. Here, we show that DistilProtBert preforms very well on singlet, doublet, and even triplet-shuffled versions of the human proteome, with AUC of 0.92, 0.91, and 0.87 respectively. Finally, we suggest that by examining the small number of false-positive classifications (i.e., shuffled sequences classified as proteins by DistilProtBert) we may be able to identify de-novo potential natural-like proteins based on random shuffling of amino acid sequences.

PDF Abstract

Code

Add Remove Mark official

yarongef/DistilProtBert official

Tasks

Add Remove

Dimensionality Reduction

Knowledge Distillation

Language Modelling

Protein Language Model

Protein Secondary Structure Prediction

Datasets

Add Datasets introduced or used in this paper

Results from the Paper

Add Remove

Ranked #4 on Protein Secondary Structure Prediction on TS115

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Protein Secondary Structure Prediction	CASP12	DistilProtBert	Q3	0.72	# 4	Compare
Protein Secondary Structure Prediction	CB513	DistilProtBert	Q3	0.79	# 6	Compare
Protein Secondary Structure Prediction	TS115	DistilProtBert	Q3	0.81	# 4	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Dense Connections • DistilBERT • Dropout • Knowledge Distillation • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • RAdam • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove