Search Results for author: Xiaojian Ma

Found 34 papers, 20 papers with code

Latent Energy-Based Odyssey: Black-Box Optimization via Expanded Exploration in the Energy-Based Latent Space

no code implementations27 May 2024 Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu

To this end, we formulate an learnable energy-based latent space, and propose Noise-intensified Telescoping density-Ratio Estimation (NTRE) scheme for variational learning of an accurate latent space model without costly Markov Chain Monte Carlo.

Density Ratio Estimation

Unifying 3D Vision-Language Understanding via Promptable Queries

no code implementations19 May 2024 Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene.

Decoder Information Retrieval +2

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

no code implementations22 Mar 2024 Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li

Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, withwide-ranging applications in embodied agents and augmented reality systems.

Scene Understanding Segmentation +2

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

no code implementations18 Mar 2024 Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.

Video Understanding

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

1 code implementation8 Mar 2024 ZiHao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang

We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination.

Code Generation Hallucination +3

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

no code implementations CVPR 2024 Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li

However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge.

Continual Learning Question Answering +1

An Embodied Generalist Agent in 3D World

1 code implementation18 Nov 2023 Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.

3D dense captioning Question Answering +3

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

no code implementations10 Nov 2023 ZiHao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents.

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

1 code implementation16 Oct 2023 Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang

We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.

Few-Shot Learning Logical Reasoning +1

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

no code implementations12 Oct 2023 Shaofei Cai, Bowei Zhang, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations.

Decoder Instruction Following

MindAgent: Emergent Gaming Interaction

no code implementations18 Sep 2023 Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao

Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration.

In-Context Learning Scheduling

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

2 code implementations14 Sep 2023 Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.

Hallucination In-Context Learning +2

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

1 code implementation ICCV 2023 Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.

Dense Captioning Question Answering +3

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

2 code implementations CVPR 2023 Shaofei Cai, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents.

Representation Learning Zero-shot Generalization

SQA3D: Situated Question Answering in 3D Scenes

1 code implementation14 Oct 2022 Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D).

Question Answering Referring Expression +1

Continuous-Time and Multi-Level Graph Representation Learning for Origin-Destination Demand Prediction

1 code implementation30 Jun 2022 Liangzhe Han, Xiaojian Ma, Leilei Sun, Bowen Du, Yanjie Fu, Weifeng Lv, Hui Xiong

Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society.

Graph Representation Learning

Latent Diffusion Energy-Based Model for Interpretable Text Modeling

2 code implementations13 Jun 2022 Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions

1 code implementation CVPR 2022 Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts.

Benchmarking Few-Shot Image Classification +5

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

1 code implementation ICLR 2022 Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i. e., systematic generalization.

Human-Object Interaction Detection Object +5

Unsupervised Foreground Extraction via Deep Region Competition

2 code implementations NeurIPS 2021 Peiyu Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu

Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background.

Image Segmentation Inductive Bias +1

Adversarial Option-Aware Hierarchical Imitation Learning

1 code implementation10 Jun 2021 Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI

In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.

Imitation Learning

HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving

no code implementations22 Feb 2021 Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu

Leveraging these concepts, they could understand the internal structure of this task, without seeing all of the problem instances.

A Mobile Robot Hand-Arm Teleoperation System by Vision and IMU

1 code implementation11 Mar 2020 Shuang Li, Jiaxi Jiang, Philipp Ruppel, Hongzhuo Liang, Xiaojian Ma, Norman Hendrich, Fuchun Sun, Jianwei Zhang

In this paper, we present a multimodal mobile teleoperation system that consists of a novel vision-based hand pose regression network (Transteleop) and an IMU-based arm tracking method.

Anatomy Image-to-Image Translation +1

Robust Robotic Pouring using Audition and Haptics

1 code implementation29 Feb 2020 Hongzhuo Liang, Chuangchuang Zhou, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Marcus Stoffel, Jianwei Zhang

Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.

Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

no code implementations25 Nov 2019 Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu, Song-Chun Zhu

Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment.

Reinforcement Learning (RL) Transfer Learning

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

no code implementations16 Nov 2019 Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu

In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations.

reinforcement-learning Reinforcement Learning (RL)

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

1 code implementation2 Mar 2019 Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Jianwei Zhang

PouringNet is trained on our collected real-world pouring dataset with multimodal sensing data, which contains more than 3000 recordings of audio, force feedback, video and trajectory data of the human hand that performs the pouring task.

Robotics Sound Audio and Speech Processing

Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network

4 code implementations17 Sep 2018 Shuang Li, Xiaojian Ma, Hongzhuo Liang, Michael Görner, Philipp Ruppel, Bing Fang, Fuchun Sun, Jianwei Zhang

In this paper, we present TeachNet, a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands.


PointNetGPD: Detecting Grasp Configurations from Point Sets

4 code implementations17 Sep 2018 Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, Jianwei Zhang

In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud.


Learning and Inferring Movement with Deep Generative Model

no code implementations18 May 2018 Mingxuan Jing, Xiaojian Ma, Fuchun Sun, Huaping Liu

Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks.

Motion Planning

Task Transfer by Preference-Based Cost Learning

no code implementations12 May 2018 Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu

The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task.

Cannot find the paper you are looking for? You can Submit a new open access paper.