no code implementations • 16 Dec 2024 • Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan, Gang Zeng, Siyuan Huang
Repurposing pre-trained diffusion models has been proven to be effective for NVS.
no code implementations • 4 Sep 2024 • Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, Siyuan Huang
Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling.
no code implementations • 25 Aug 2024 • Rundong Luo, Haoran Geng, Congyue Deng, Puhao Li, Zan Wang, Baoxiong Jia, Leonidas Guibas, Siyuan Huang
We also demonstrate our applications in 3D printing, robot manipulation, and sequential part generation, showing our strength in realistic tasks with the demand for high physical plausibility.
no code implementations • 13 Aug 2024 • Yu Liu, Baoxiong Jia, Yixin Chen, Siyuan Huang
The ability to distill object-centric abstractions from intricate visual scenes underpins human-level generalization.
no code implementations • 7 Aug 2024 • Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence.
no code implementations • 19 May 2024 • Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene.
no code implementations • 16 Apr 2024 • Peiyuan Zhi, Zhiyuan Zhang, Muzhi Han, Zeyu Zhang, Zhitian Li, Ziyuan Jiao, Baoxiong Jia, Siyuan Huang
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback.
no code implementations • CVPR 2024 • Yandan Yang, Baoxiong Jia, Peiyuan Zhi, Siyuan Huang
With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation.
1 code implementation • 26 Mar 2024 • Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang
Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges.
no code implementations • 17 Jan 2024 • Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data.
no code implementations • CVPR 2024 • Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang
Despite significant advancements in text-to-motion synthesis generating language-guided human motion within 3D environments poses substantial challenges.
1 code implementation • 18 Nov 2023 • Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.
1 code implementation • ICCV 2023 • Bo Dai, Linge Wang, Baoxiong Jia, Zeyu Zhang, Song-Chun Zhu, Chi Zhang, Yixin Zhu
Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy.
1 code implementation • ICCV 2023 • Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes.
2 code implementations • CVPR 2023 • Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu
SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning.
1 code implementation • 28 Nov 2022 • Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian Ma, Qing Li, Siyuan Huang
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
2 code implementations • 17 Oct 2022 • Baoxiong Jia, Yu Liu, Siyuan Huang
The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning.
1 code implementation • 8 Oct 2022 • Baoxiong Jia, Ting Lei, Song-Chun Zhu, Siyuan Huang
The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i. e., state changes), and their causal dependencies.
2 code implementations • 13 Jun 2022 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.
no code implementations • 25 Nov 2021 • Chi Zhang, Sirui Xie, Baoxiong Jia, Ying Nian Wu, Song-Chun Zhu, Yixin Zhu
Extensive experiments show that by incorporating an algebraic treatment, the ALANS learner outperforms various pure connectionist models in domains requiring systematic generalization.
no code implementations • CVPR 2021 • Chi Zhang, Baoxiong Jia, Song-Chun Zhu, Yixin Zhu
To fill in this gap, we propose a neuro-symbolic Probabilistic Abduction and Execution (PrAE) learner; central to the PrAE learner is the process of probabilistic abduction and execution on a probabilistic scene representation, akin to the mental manipulation of objects.
no code implementations • CVPR 2021 • Chi Zhang, Baoxiong Jia, Mark Edmonds, Song-Chun Zhu, Yixin Zhu
Causal induction, i. e., identifying unobservable mechanisms that lead to the observable relations among variables, has played a pivotal role in modern scientific discovery, especially in scenarios with only sparse and limited data.
no code implementations • 1 Jan 2021 • Chi Zhang, Sirui Xie, Baoxiong Jia, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
We further show that the algebraic representation learned can be decoded by isomorphism and used to generate an answer.
1 code implementation • ECCV 2020 • Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, Song-Chun Zhu
Understanding and interpreting human actions is a long-standing challenge and a critical indicator of perception in artificial intelligence.
1 code implementation • NeurIPS 2019 • Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, Hongjing Lu, Song-Chun Zhu
"Thinking in pictures," [1] i. e., spatial-temporal reasoning, effortless and instantaneous for humans, is believed to be a significant ability to perform logical induction and a crucial factor in the intellectual history of technology development.
no code implementations • CVPR 2019 • Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, Song-Chun Zhu
In this work, we propose a new dataset, built in the context of Raven's Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning in a hierarchical representation.
1 code implementation • ECCV 2018 • Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, Song-Chun Zhu
For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels.
Ranked #32 on
Human-Object Interaction Detection
on V-COCO
no code implementations • ICML 2018 • Siyuan Qi, Baoxiong Jia, Song-Chun Zhu
Future predictions on sequence data (e. g., videos or audios) require the algorithms to capture non-Markovian and compositional properties of high-level semantics.