Do Large Scale Molecular Language Representations Capture Important Structural Information?
Predicting the chemical properties of a molecule is of great importance in many applications, including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less computationally complex cost when compared to, for example, Density Functional Theory (DFT) calculations. Various representation learning methods in a supervised setting, including the features extracted using graph neural nets, have emerged for such tasks. However, the vast chemical space and the limited availability of labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer. This model employs a linear attention mechanism coupled with highly parallelized training on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation outperforms supervised and unsupervised graph neural net baselines on several regression and classification tasks from 10 benchmark datasets, while performing competitively on others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer indeed learns a molecule's local and global structural aspects. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to predict diverse molecular properties, including quantum-chemical propertiesPDF Abstract