UNIMO

Introduced by Li et al. in UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

UNIMO is a multi-modal pre-training architecture that can effectively adapt to both single modal and multimodal understanding and generation tasks. UNIMO learns visual representations and textual representations simultaneously, and unifies them into the same semantic space via cross-modal contrastive learning (CMCL) based on a large-scale corpus of image collections, text corpus and image-text pairs. The CMCL aligns the visual representation and textual representation, and unifies them into the same semantic space based on image-text pairs.

Source: UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Captioning	2	20.00%
Image Generation	1	10.00%
Question Answering	1	10.00%
Text-to-Image Generation	1	10.00%
Visual Question Answering (VQA)	1	10.00%
Multimodal Sentiment Analysis	1	10.00%
Sentiment Analysis	1	10.00%
Video Understanding	1	10.00%
Cross-Modal Retrieval	1	10.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
CMCL	Self-Supervised Learning

Categories

Add Remove

Multi-Modal Methods

Vision and Language Pre-Trained Models