To construct the MICROSOFT RESEARCH MULTIMODAL ALIGNED RECIPE CORPUS the authors first extract a large number of text and video recipes from the web. The goal is to find joint alignments between multiple text recipes and multiple video recipes for the same dish. The task is challenging, as different recipes vary in their order of instructions and use of ingredients. Moreover, video instructions can be noisy, and text and video instructions include different levels of specificity in their descriptions.
Source: A Recipe for Creating Multimodal Aligned Datasets for Sequential TasksPaper | Code | Results | Date | Stars |
---|