One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data.
Ranked #5 on Visual Question Answering on MSVD-QA
In this work, we propose to exploit the natural correlation in narrations and the visual presence of objects in video, to learn an object detector and retrieval without any manual labeling involved.
We show that the time consuming local annotations involved in supervised learning can be addressed by a weakly supervised method that can leverage a subset of locally annotated data.
The high cost of generating expert annotations, poses a strong limitation for supervised machine learning methods in medical imaging.