no code implementations • 27 May 2024 • Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu
To this end, we formulate an learnable energy-based latent space, and propose Noise-intensified Telescoping density-Ratio Estimation (NTRE) scheme for variational learning of an accurate latent space model without costly Markov Chain Monte Carlo.
no code implementations • 19 May 2024 • Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene.
no code implementations • 22 Mar 2024 • Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li
Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, withwide-ranging applications in embodied agents and augmented reality systems.
no code implementations • 18 Mar 2024 • Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.
no code implementations • 8 Mar 2024 • ZiHao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang
We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination.
no code implementations • 18 Dec 2023 • Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li
However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge.
1 code implementation • 18 Nov 2023 • Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.
no code implementations • 10 Nov 2023 • ZiHao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents.
1 code implementation • 16 Oct 2023 • Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang
We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.
Ranked #1 on Visual Reasoning on Bongard-OpenWorld
no code implementations • 12 Oct 2023 • Shaofei Cai, Bowei Zhang, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations.
no code implementations • 18 Sep 2023 • Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao
Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration.
2 code implementations • 14 Sep 2023 • Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang
In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.
Ranked #16 on Visual Reasoning on Winoground
1 code implementation • ICCV 2023 • Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.
1 code implementation • 3 Feb 2023 • ZiHao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang
We investigate the challenge of task planning for multi-task embodied agents in open-world environments.
2 code implementations • CVPR 2023 • Shaofei Cai, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang
We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents.
1 code implementation • 28 Nov 2022 • Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian Ma, Qing Li, Siyuan Huang
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
1 code implementation • 14 Oct 2022 • Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang
We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D).
Ranked #1 on Referring Expression on SQA3D
1 code implementation • 30 Jun 2022 • Liangzhe Han, Xiaojian Ma, Leilei Sun, Bowen Du, Yanjie Fu, Weifeng Lv, Hui Xiong
Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society.
2 code implementations • 13 Jun 2022 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.
1 code implementation • CVPR 2022 • Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts.
Ranked #1 on Few-Shot Image Classification on Bongard-HOI
1 code implementation • ICLR 2022 • Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i. e., systematic generalization.
Ranked #1 on Zero-Shot Human-Object Interaction Detection on HICO
2 code implementations • NeurIPS 2021 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background.
1 code implementation • 10 Jun 2021 • Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI
In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.
no code implementations • 22 Feb 2021 • Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu
Leveraging these concepts, they could understand the internal structure of this task, without seeing all of the problem instances.
1 code implementation • 11 Mar 2020 • Shuang Li, Jiaxi Jiang, Philipp Ruppel, Hongzhuo Liang, Xiaojian Ma, Norman Hendrich, Fuchun Sun, Jianwei Zhang
In this paper, we present a multimodal mobile teleoperation system that consists of a novel vision-based hand pose regression network (Transteleop) and an IMU-based arm tracking method.
1 code implementation • 29 Feb 2020 • Hongzhuo Liang, Chuangchuang Zhou, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Marcus Stoffel, Jianwei Zhang
Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.
no code implementations • 25 Nov 2019 • Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu, Song-Chun Zhu
Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment.
no code implementations • 16 Nov 2019 • Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu
In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations.
no code implementations • NeurIPS 2019 • Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang, Chuang Gan
This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations.
1 code implementation • 2 Mar 2019 • Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Jianwei Zhang
PouringNet is trained on our collected real-world pouring dataset with multimodal sensing data, which contains more than 3000 recordings of audio, force feedback, video and trajectory data of the human hand that performs the pouring task.
Robotics Sound Audio and Speech Processing
4 code implementations • 17 Sep 2018 • Shuang Li, Xiaojian Ma, Hongzhuo Liang, Michael Görner, Philipp Ruppel, Bing Fang, Fuchun Sun, Jianwei Zhang
In this paper, we present TeachNet, a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands.
Robotics
4 code implementations • 17 Sep 2018 • Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, Jianwei Zhang
In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud.
Robotics
no code implementations • 18 May 2018 • Mingxuan Jing, Xiaojian Ma, Fuchun Sun, Huaping Liu
Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks.
no code implementations • 12 May 2018 • Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu
The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task.