Adversarial Modality Alignment Network for Cross-Modal Molecule Retrieval

The cross-modal molecule retrieval (Text2Mol) task aims to bridge the semantic gap between molecules and natural language descriptions. A solution to this non-trivial problem relies on graph convolutional network (GCN) and cross-modal attention with contrastive learning for reasonable results. However, there exist the following issues: 1) the cross-modal attention mechanism is only in favor of text representations and can not provide helpful information for molecule representations. 2) the GCN-based molecule encoder ignores edge features and the importance of various substructures of a molecule. 3) the retrieval learning loss function is rather simplistic. This paper further investigates the Text2Mol problem and proposes a novel Adversarial Modality Alignment Network (AMAN)-based method to sufficiently learn both description and molecule information. Our method utilizes a SciBERT as a text encoder and a graph transformer network as a molecule encoder to generate multimodal representations. Then an adversarial network is used to align these modalities interactively. Meanwhile, a triplet loss function is leveraged to perform retrieval learning and further enhance the modality alignment. Experiments on the ChEBI-20 dataset show the effectiveness of our AMAN method compared with baselines.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Cross-Modal Retrieval ChEBI-20 AMAN Mean Rank 16.01 # 1
Test MRR 64.7 # 1
Hits@1 49.4 # 1
Hits@10 92.1 # 1

Methods