Translation between Molecules and Natural Language

25 Apr 2022  ·  Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji ·

We present $\textbf{MolT5}$ $-$ a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text-based de novo Molecule Generation ChEBI-20 MolT5-Large Text2Mol 55.4 # 11
BLEU 85.4 # 4
Exact Match 30.2 # 5
Levenshtein 16.07 # 14
MACCS FTS 83.4 # 14
RDK FTS 74.6 # 9
Morgan FTS 68.4 # 11
Frechet ChemNet Distance (FCD) 1.20 # 13
Validity 90.5 # 8
Parameter Count 770000000 # 17
Molecule Captioning ChEBI-20 MolT5-Small BLEU-2 51.9 # 19
BLEU-4 43.6 # 19
ROUGE-1 62.0 # 15
ROUGE-2 46.9 # 15
ROUGE-L 56.3 # 14
METEOR 55.1 # 19
Text2Mol 54.0 # 12
Molecule Captioning ChEBI-20 MolT5-Base BLEU-2 54.0 # 17
BLEU-4 45.7 # 16
ROUGE-1 63.4 # 11
ROUGE-2 48.5 # 12
ROUGE-L 57.8 # 12
METEOR 56.9 # 16
Text2Mol 54.7 # 11
Text-based de novo Molecule Generation ChEBI-20 MolT5-Large-HV Text2Mol 59.0 # 2
BLEU 81.0 # 10
Exact Match 31.4 # 4
Levenshtein 16.758 # 13
MACCS FTS 87.2 # 7
RDK FTS 78.6 # 5
Morgan FTS 72.2 # 6
Frechet ChemNet Distance (FCD) 0.44 # 8
Validity 99.6 # 3
Parameter Count 770000000 # 17
Text-based de novo Molecule Generation ChEBI-20 MolT5-small Text2Mol 48.2 # 13
BLEU 75.5 # 15
Exact Match 7.9 # 17
Levenshtein 25.988 # 3
MACCS FTS 70.3 # 18
RDK FTS 56.8 # 18
Morgan FTS 51.7 # 18
Frechet ChemNet Distance (FCD) 2.49 # 15
Validity 72.1 # 18
Parameter Count 60000000 # 5
Text-based de novo Molecule Generation ChEBI-20 MolT5-base Text2Mol 49.6 # 12
BLEU 76.9 # 13
Exact Match 8.1 # 16
Levenshtein 24.458 # 5
MACCS FTS 72.1 # 17
RDK FTS 58.8 # 16
Morgan FTS 52.9 # 16
Frechet ChemNet Distance (FCD) 2.18 # 14
Validity 77.2 # 17
Parameter Count 220000000 # 10
Molecule Captioning ChEBI-20 MolT5-Large BLEU-2 59.4 # 8
BLEU-4 50.8 # 8
ROUGE-1 65.4 # 7
ROUGE-2 51.0 # 8
ROUGE-L 59.4 # 7
METEOR 61.4 # 8
Text2Mol 58.2 # 4

Methods


No methods listed for this paper. Add relevant methods here