The WikiScenes dataset consists of paired images and language descriptions capturing world landmarks and cultural sites, with associated 3D models and camera poses. WikiScenes is derived from the massive public catalog of freely-licensed crowdsourced data in the Wikimedia Commons project, which contains a large variety of images with captions and other metadata.

The dataset contains two forms of textual descriptions for each image: (1) Captions associated with images, describing the image using free-form language, and (2) The WikiCategory hierarchy obtained according to the hierarchy of WikiCategories associated with each image (see the examples in the image below). Overall, WikiScenes contains approximately 63K images with textual descriptions.


