|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.
This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR).
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains with applications in visual question answering.
To endow such a crucial cognitive ability to machine intelligence, we propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG).
In this paper we present an approach and a benchmark for visual reasoning in robotics applications, in particular small object grasping and manipulation.
We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.
We address these challenges using interpretable deep visual representations for rope, extending recent work on dense object descriptors for robot manipulation.
To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
Abstract reasoning refers to the ability to analyze information, discover rules at an intangible level, and solve problems in innovative ways.