no code implementations • 11 Jun 2025 • Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation.
no code implementations • 5 Jun 2025 • Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li
However, grounding referring expressions beyond objects in 3D scenes remains unexplored.
no code implementations • 15 May 2025 • Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, Qing Li
This paper investigates training better visual world models for robot manipulation, i. e., models that can predict future visual observations by conditioning on past frames and robot actions.
no code implementations • 30 Apr 2025 • Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent.
1 code implementation • 17 Apr 2025 • Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao wu, Song-Chun Zhu, Qing Li
Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks.
no code implementations • 20 Mar 2025 • Muyao Li, ZiHao Wang, Kaichen He, Xiaojian Ma, Yitao Liang
Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks.
no code implementations • 9 Jan 2025 • Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing Li, Yizhou Wang
Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4. 6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)).
no code implementations • 31 Dec 2024 • Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li
This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI.
no code implementations • 20 Dec 2024 • Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks.
no code implementations • 7 Dec 2024 • Shaofei Cai, Bowei Zhang, ZiHao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang
Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI.
1 code implementation • CVPR 2025 • Shaofei Cai, ZiHao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang
Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2.
1 code implementation • 4 Sep 2024 • Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, Siyuan Huang
Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling.
no code implementations • 7 Aug 2024 • Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li
Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence.
1 code implementation • 7 Jul 2024 • Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, Baobao Chang
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing.
Ranked #5 on
Image Editing
on ImgEdit-Data
no code implementations • 27 Jun 2024 • ZiHao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens.
no code implementations • 27 May 2024 • Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu
To this end, we formulate an learnable energy-based latent space, and propose Noise-intensified Telescoping density-Ratio Estimation (NTRE) scheme for variational learning of an accurate latent space model without costly Markov Chain Monte Carlo.
no code implementations • 19 May 2024 • Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene.
Ranked #10 on
3D Question Answering (3D-QA)
on SQA3D
no code implementations • 22 Mar 2024 • Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li
Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training.
no code implementations • 18 Mar 2024 • Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.
1 code implementation • 8 Mar 2024 • ZiHao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang
We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination.
no code implementations • CVPR 2024 • Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li
However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge.
1 code implementation • 18 Nov 2023 • Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.
no code implementations • 10 Nov 2023 • ZiHao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents.
1 code implementation • 16 Oct 2023 • Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang
We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.
Ranked #3 on
Visual Reasoning
on Bongard-OpenWorld
no code implementations • 12 Oct 2023 • Shaofei Cai, Bowei Zhang, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations.
1 code implementation • NeurIPS 2023 • Peiyu Yu, Yaxuan Zhu, Sirui Xie, Xiaojian Ma, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu
To remedy this sampling issue, in this paper we introduce a simple but effective diffusion-based amortization method for long-run MCMC sampling and develop a novel learning algorithm for the latent space EBM based on it.
no code implementations • 18 Sep 2023 • Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao
Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration.
2 code implementations • 14 Sep 2023 • Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.
Ranked #17 on
Visual Reasoning
on Winoground
1 code implementation • ICCV 2023 • Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.
Ranked #7 on
3D Question Answering (3D-QA)
on SQA3D
1 code implementation • 3 Feb 2023 • ZiHao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang
We investigate the challenge of task planning for multi-task embodied agents in open-world environments.
2 code implementations • CVPR 2023 • Shaofei Cai, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents.
1 code implementation • 28 Nov 2022 • Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian Ma, Qing Li, Siyuan Huang
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
1 code implementation • 14 Oct 2022 • Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D).
Ranked #1 on
Referring Expression
on SQA3D
1 code implementation • 30 Jun 2022 • Liangzhe Han, Xiaojian Ma, Leilei Sun, Bowen Du, Yanjie Fu, Weifeng Lv, Hui Xiong
Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society.
2 code implementations • 13 Jun 2022 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.
1 code implementation • CVPR 2022 • Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts.
Ranked #1 on
Few-Shot Image Classification
on Bongard-HOI
1 code implementation • ICLR 2022 • Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i. e., systematic generalization.
Ranked #1 on
Zero-Shot Human-Object Interaction Detection
on HICO
2 code implementations • NeurIPS 2021 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background.
1 code implementation • 10 Jun 2021 • Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI
In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.
no code implementations • 22 Feb 2021 • Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
Leveraging these concepts, they could understand the internal structure of this task, without seeing all of the problem instances.
1 code implementation • 11 Mar 2020 • Shuang Li, Jiaxi Jiang, Philipp Ruppel, Hongzhuo Liang, Xiaojian Ma, Norman Hendrich, Fuchun Sun, Jianwei Zhang
In this paper, we present a multimodal mobile teleoperation system that consists of a novel vision-based hand pose regression network (Transteleop) and an IMU-based arm tracking method.
1 code implementation • 29 Feb 2020 • Hongzhuo Liang, Chuangchuang Zhou, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Marcus Stoffel, Jianwei Zhang
Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.
no code implementations • 25 Nov 2019 • Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu, Song-Chun Zhu
Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment.
no code implementations • 16 Nov 2019 • Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu
In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations.
no code implementations • NeurIPS 2019 • Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang, Chuang Gan
This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations.
1 code implementation • 2 Mar 2019 • Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Jianwei Zhang
PouringNet is trained on our collected real-world pouring dataset with multimodal sensing data, which contains more than 3000 recordings of audio, force feedback, video and trajectory data of the human hand that performs the pouring task.
Robotics Sound Audio and Speech Processing
4 code implementations • 17 Sep 2018 • Shuang Li, Xiaojian Ma, Hongzhuo Liang, Michael Görner, Philipp Ruppel, Bing Fang, Fuchun Sun, Jianwei Zhang
In this paper, we present TeachNet, a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands.
Robotics
4 code implementations • 17 Sep 2018 • Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, Jianwei Zhang
In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud.
Robotics
no code implementations • 18 May 2018 • Mingxuan Jing, Xiaojian Ma, Fuchun Sun, Huaping Liu
Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks.
no code implementations • 12 May 2018 • Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu
The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task.