Visual Madlibs is a dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context.
Source: Visual Madlibs: Fill in the blank Image Generation and Question AnsweringPaper | Code | Results | Date | Stars |
---|