Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.
1 PAPER • NO BENCHMARKS YET