Our understanding of the visual world goes beyond naming objects, encompassing our ability to parse objects into meaningful parts, attributes, and relations.
GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction.
Visual question answering (VQA) models have been shown to over-rely on linguistic biases in VQA datasets, answering questions "blindly" without considering visual context.
Visual question answering (VQA) models respond to open-ended natural language questions about images.
Because related words appear in similar contexts, such spaces - called "word embeddings" - can be learned from patterns of lexical co-occurrences in natural language.