Multi-Modal Methods

UNIMO is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via cross-modal contrastive learning (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic space based on image-text pairs.

Source: UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning


