Are Visual-Linguistic Models Commonsense Knowledge Bases?
Despite the recent success of pretrained language models as on-the-fly knowledge sources for various downstream tasks, they are shown to inadequately represent trivial common facts that vision typically captures. This limits their application to natural language understanding tasks that require commonsense knowledge. We seek to determine the capability of pretrained visual-linguistic models as knowledge sources on demand. To this end, we systematically compare language-only and visual-linguistic models in a zero-shot commonsense question answering inference task. We find that visual-linguistic models are highly promising regarding their benefit for text-only tasks on certain types of commonsense knowledge associated with the visual world. Surprisingly, this knowledge can be activated even when no visual input is given during inference, suggesting an effective multimodal fusion during pretraining. However, we reveal that there is still a huge space for improvement towards better cross-modal reasoning abilities and pretraining strategies for event understanding.PDF Abstract