Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics.
Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy.
To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes.
SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning.
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning.
The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i. e., state changes), and their causal dependencies.
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.
Extensive experiments show that by incorporating an algebraic treatment, the ALANS learner outperforms various pure connectionist models in domains requiring systematic generalization.
Causal induction, i. e., identifying unobservable mechanisms that lead to the observable relations among variables, has played a pivotal role in modern scientific discovery, especially in scenarios with only sparse and limited data.
To fill in this gap, we propose a neuro-symbolic Probabilistic Abduction and Execution (PrAE) learner; central to the PrAE learner is the process of probabilistic abduction and execution on a probabilistic scene representation, akin to the mental manipulation of objects.
We further show that the algebraic representation learned can be decoded by isomorphism and used to generate an answer.
Understanding and interpreting human actions is a long-standing challenge and a critical indicator of perception in artificial intelligence.
"Thinking in pictures,"  i. e., spatial-temporal reasoning, effortless and instantaneous for humans, is believed to be a significant ability to perform logical induction and a crucial factor in the intellectual history of technology development.
In this work, we propose a new dataset, built in the context of Raven's Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning in a hierarchical representation.
For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels.
Ranked #31 on Human-Object Interaction Detection on V-COCO
Future predictions on sequence data (e. g., videos or audios) require the algorithms to capture non-Markovian and compositional properties of high-level semantics.