Multimodal C4 (MMC4) is an augmentation of the popular text-only c4 corpus with images interleaved. The corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
Source: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With TextPaper | Code | Results | Date | Stars |
---|