The benchmarks section lists all benchmarks using a given dataset or any of
its variants. We use variants to distinguish between results evaluated on
slightly different versions of the same dataset. For example, ImageNet 32⨉32
and ImageNet 64⨉64 are variants of the ImageNet dataset.
A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images).
MMCHAT contains image-grounded dialogues collected from real conversations on social media.
We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image.
We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images
We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)