Single Layers of Attention Suffice to Predict Protein Contacts

The established approach to protein contact prediction frames the task as one of graph selection, extracting contacts by estimating the parameters of a Potts model. Another approach has recently appeared which leverages large pretrained Transformers, producing contacts by combining attention maps from various heads. In this work, we provide evidence that these approaches are not as different as they initially seem by providing a theoretical connection between attention and Potts models. To do so, we introduce a simplified attention model called \textit{factored attention}. On the one hand, factored attention is a direct simplification of multihead scaled dot-product attention in the Transformer. On the other hand, factored attention defines a valid Pairwise Markov Random Field and includes Potts models as a sparse special case. Examining factored attention allows us to explore the relative merits of each model class when learning contacts from an aligned protein family. We go on to empirically assess the performance of factored attention by training on a wide range of alignments of individual protein families. We further compare to a large pretrained Transformer trained on a corpus of unaligned protein sequences. We find that a single layer of attention is comparable to state-of-the-art Potts model at contact prediction. Taken together, these results provide motivation for training Transformers on large protein datasets.

PDF Abstract ICLR Workshop 2021 PDF ICLR Workshop 2021 Abstract
No code implementations yet. Submit your code now

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods