Search Results for author: Xiaojian Ma

Found 50 papers, 25 papers with code

LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

no code implementations11 Jun 2025 Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Yue Fan, Junchao He, Wenxin Tan, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang

A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation.

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

no code implementations15 May 2025 Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, Qing Li

This paper investigates training better visual world models for robot manipulation, i. e., models that can predict future visual observations by conditioning on past frames and robot actions.

Robot Manipulation Semantic Similarity +2

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

1 code implementation17 Apr 2025 Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao wu, Song-Chun Zhu, Qing Li

Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks.

Articles

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

no code implementations20 Mar 2025 Muyao Li, ZiHao Wang, Kaichen He, Xiaojian Ma, Yitao Liang

Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks.

Imitation Learning Minecraft +1

LongViTU: Instruction Tuning for Long-Form Video Understanding

no code implementations9 Jan 2025 Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing Li, Yizhou Wang

Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4. 6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)).

EgoSchema Form +2

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

no code implementations31 Dec 2024 Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li

This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI.

Robot Manipulation Scene Understanding +1

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

no code implementations20 Dec 2024 Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks.

Language Modeling Language Modelling

GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents

no code implementations7 Dec 2024 Shaofei Cai, Bowei Zhang, ZiHao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang

Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI.

Instruction Following

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

1 code implementation CVPR 2025 Shaofei Cai, ZiHao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang

Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2.

Decision Making Minecraft +3

Multi-modal Situated Reasoning in 3D Scenes

1 code implementation4 Sep 2024 Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, Siyuan Huang

Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling.

3D Question Answering (3D-QA)

Task-oriented Sequential Grounding in 3D Scenes

no code implementations7 Aug 2024 Zhuofan Zhang, Ziyu Zhu, Pengxiang Li, Tengyu Liu, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Siyuan Huang, Qing Li

Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence.

3D visual grounding

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

1 code implementation7 Jul 2024 Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, Baobao Chang

This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing.

Diversity +1

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

no code implementations27 Jun 2024 ZiHao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang

First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens.

Decoder Imitation Learning +4

Latent Energy-Based Odyssey: Black-Box Optimization via Expanded Exploration in the Energy-Based Latent Space

no code implementations27 May 2024 Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu

To this end, we formulate an learnable energy-based latent space, and propose Noise-intensified Telescoping density-Ratio Estimation (NTRE) scheme for variational learning of an accurate latent space model without costly Markov Chain Monte Carlo.

Density Ratio Estimation

Unifying 3D Vision-Language Understanding via Promptable Queries

no code implementations19 May 2024 Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene.

3D Question Answering (3D-QA) Decoder +3

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

no code implementations22 Mar 2024 Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li

Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training.

Instance Segmentation Object Localization +4

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

no code implementations18 Mar 2024 Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.

EgoSchema Video Understanding

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

1 code implementation8 Mar 2024 ZiHao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang

We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination.

Code Generation Hallucination +4

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

no code implementations CVPR 2024 Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li

However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge.

Continual Learning Question Answering +1

An Embodied Generalist Agent in 3D World

1 code implementation18 Nov 2023 Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.

3D dense captioning 3D Question Answering (3D-QA) +4

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

no code implementations10 Nov 2023 ZiHao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents.

Minecraft

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

1 code implementation16 Oct 2023 Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang

We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.

Few-Shot Learning Form +2

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

no code implementations12 Oct 2023 Shaofei Cai, Bowei Zhang, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations.

Decoder Instruction Following +1

Learning Energy-Based Prior Model with Diffusion-Amortized MCMC

1 code implementation NeurIPS 2023 Peiyu Yu, Yaxuan Zhu, Sirui Xie, Xiaojian Ma, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu

To remedy this sampling issue, in this paper we introduce a simple but effective diffusion-based amortization method for long-run MCMC sampling and develop a novel learning algorithm for the latent space EBM based on it.

valid

MindAgent: Emergent Gaming Interaction

no code implementations18 Sep 2023 Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao

Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration.

In-Context Learning Minecraft +1

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

2 code implementations14 Sep 2023 Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.

Hallucination In-Context Learning +4

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

1 code implementation ICCV 2023 Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.

3D Question Answering (3D-QA) Dense Captioning +4

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

2 code implementations CVPR 2023 Shaofei Cai, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents.

Diversity Minecraft +2

SQA3D: Situated Question Answering in 3D Scenes

1 code implementation14 Oct 2022 Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D).

Question Answering Referring Expression +1

Latent Diffusion Energy-Based Model for Interpretable Text Modeling

2 code implementations13 Jun 2022 Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions

1 code implementation CVPR 2022 Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts.

Benchmarking Few-Shot Image Classification +5

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

1 code implementation ICLR 2022 Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i. e., systematic generalization.

Human-Object Interaction Detection Object +5

Unsupervised Foreground Extraction via Deep Region Competition

2 code implementations NeurIPS 2021 Peiyu Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu

Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background.

Image Segmentation Inductive Bias +2

Adversarial Option-Aware Hierarchical Imitation Learning

1 code implementation10 Jun 2021 Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI

In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.

Imitation Learning

HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving

no code implementations22 Feb 2021 Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu

Leveraging these concepts, they could understand the internal structure of this task, without seeing all of the problem instances.

A Mobile Robot Hand-Arm Teleoperation System by Vision and IMU

1 code implementation11 Mar 2020 Shuang Li, Jiaxi Jiang, Philipp Ruppel, Hongzhuo Liang, Xiaojian Ma, Norman Hendrich, Fuchun Sun, Jianwei Zhang

In this paper, we present a multimodal mobile teleoperation system that consists of a novel vision-based hand pose regression network (Transteleop) and an IMU-based arm tracking method.

Anatomy Image-to-Image Translation +1

Robust Robotic Pouring using Audition and Haptics

1 code implementation29 Feb 2020 Hongzhuo Liang, Chuangchuang Zhou, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Marcus Stoffel, Jianwei Zhang

Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.

Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

no code implementations25 Nov 2019 Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu, Song-Chun Zhu

Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment.

Reinforcement Learning Reinforcement Learning (RL) +1

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

no code implementations16 Nov 2019 Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu

In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations.

reinforcement-learning Reinforcement Learning +1

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

1 code implementation2 Mar 2019 Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Jianwei Zhang

PouringNet is trained on our collected real-world pouring dataset with multimodal sensing data, which contains more than 3000 recordings of audio, force feedback, video and trajectory data of the human hand that performs the pouring task.

Robotics Sound Audio and Speech Processing

Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network

4 code implementations17 Sep 2018 Shuang Li, Xiaojian Ma, Hongzhuo Liang, Michael Görner, Philipp Ruppel, Bing Fang, Fuchun Sun, Jianwei Zhang

In this paper, we present TeachNet, a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands.

Robotics

PointNetGPD: Detecting Grasp Configurations from Point Sets

4 code implementations17 Sep 2018 Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, Jianwei Zhang

In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud.

Robotics

Learning and Inferring Movement with Deep Generative Model

no code implementations18 May 2018 Mingxuan Jing, Xiaojian Ma, Fuchun Sun, Huaping Liu

Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks.

model Motion Planning

Task Transfer by Preference-Based Cost Learning

no code implementations12 May 2018 Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu

The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task.

Reinforcement Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.