Search Results for author: Xiaojian Ma

Found 34 papers, 19 papers with code

Latent Energy-Based Odyssey: Black-Box Optimization via Expanded Exploration in the Energy-Based Latent Space

no code implementations • 27 May 2024 • Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu

To this end, we formulate an learnable energy-based latent space, and propose Noise-intensified Telescoping density-Ratio Estimation (NTRE) scheme for variational learning of an accurate latent space model without costly Markov Chain Monte Carlo.

Density Ratio Estimation

Paper
Add Code

Unifying 3D Vision-Language Understanding via Promptable Queries

no code implementations • 19 May 2024 • Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene.

Decoder Information Retrieval +2

Paper
Add Code

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

no code implementations • 22 Mar 2024 • Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, Qing Li

Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, withwide-ranging applications in embodied agents and augmented reality systems.

Scene Understanding Segmentation +2

Paper
Add Code

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

no code implementations • 18 Mar 2024 • Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos.

Video Understanding

Paper
Add Code

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

no code implementations • 8 Mar 2024 • ZiHao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, Yitao Liang

We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination.

Code Generation Hallucination +3

Paper
Add Code

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

no code implementations • 18 Dec 2023 • Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li

However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge.

Continual Learning Question Answering +1

Paper
Add Code

An Embodied Generalist Agent in 3D World

1 code implementation • 18 Nov 2023 • Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e. g., 3D grounding, embodied reasoning and acting.

3D dense captioning Question Answering +3

255

Paper
Code

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

no code implementations • 10 Nov 2023 • ZiHao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents.

Paper
Add Code

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

1 code implementation • 16 Oct 2023 • Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang

We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems.

Ranked #1 on Visual Reasoning on Bongard-OpenWorld

Few-Shot Learning Logical Reasoning +1

Paper
Code

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

no code implementations • 12 Oct 2023 • Shaofei Cai, Bowei Zhang, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations.

Decoder Instruction Following

Paper
Add Code

MindAgent: Emergent Gaming Interaction

no code implementations • 18 Sep 2023 • Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao

Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration.

In-Context Learning Scheduling

Paper
Add Code

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

2 code implementations • 14 Sep 2023 • Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts.

Ranked #16 on Visual Reasoning on Winoground

Hallucination In-Context Learning +2

305

Paper
Code

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

1 code implementation • ICCV 2023 • Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence.

Dense Captioning Question Answering +3

168

Paper
Code

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

1 code implementation • 3 Feb 2023 • ZiHao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang

We investigate the challenge of task planning for multi-task embodied agents in open-world environments.

218

Paper
Code

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

2 code implementations • CVPR 2023 • Shaofei Cai, ZiHao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents.

Representation Learning Zero-shot Generalization

218

Paper
Code

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

1 code implementation • 28 Nov 2022 • Jiangyong Huang, William Yicheng Zhu, Baoxiong Jia, Zan Wang, Xiaojian Ma, Qing Li, Siyuan Huang

Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.

3D Reconstruction Decoder +1

Paper
Code

SQA3D: Situated Question Answering in 3D Scenes

1 code implementation • 14 Oct 2022 • Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang

We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D).

Ranked #1 on Referring Expression on SQA3D

Question Answering Referring Expression +1

103

Paper
Code

Continuous-Time and Multi-Level Graph Representation Learning for Origin-Destination Demand Prediction

1 code implementation • 30 Jun 2022 • Liangzhe Han, Xiaojian Ma, Leilei Sun, Bowen Du, Yanjie Fu, Weifeng Lv, Hui Xiong

Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society.

Graph Representation Learning

Paper
Code

Latent Diffusion Energy-Based Model for Interpretable Text Modeling

2 code implementations • 13 Jun 2022 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in generative modeling.

Paper
Code

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions

1 code implementation • CVPR 2022 • Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

A significant gap remains between today's visual pattern recognition models and human-level visual cognition especially when it comes to few-shot learning and compositional reasoning of novel concepts.

Ranked #1 on Few-Shot Image Classification on Bongard-HOI

Benchmarking Few-Shot Image Classification +5

Paper
Code

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

1 code implementation • ICLR 2022 • Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar

This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i. e., systematic generalization.

Ranked #1 on Zero-Shot Human-Object Interaction Detection on HICO

Human-Object Interaction Detection Object +5

Paper
Code

Unsupervised Foreground Extraction via Deep Region Competition

2 code implementations • NeurIPS 2021 • Peiyu Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu

Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background.

Image Segmentation Inductive Bias +1

Paper
Code

Adversarial Option-Aware Hierarchical Imitation Learning

1 code implementation • 10 Jun 2021 • Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI

In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.

Imitation Learning

Paper
Code

HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving

no code implementations • 22 Feb 2021 • Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu

Leveraging these concepts, they could understand the internal structure of this task, without seeing all of the problem instances.

Paper
Add Code

A Mobile Robot Hand-Arm Teleoperation System by Vision and IMU

1 code implementation • 11 Mar 2020 • Shuang Li, Jiaxi Jiang, Philipp Ruppel, Hongzhuo Liang, Xiaojian Ma, Norman Hendrich, Fuchun Sun, Jianwei Zhang

In this paper, we present a multimodal mobile teleoperation system that consists of a novel vision-based hand pose regression network (Transteleop) and an IMU-based arm tracking method.

Anatomy Image-to-Image Translation +1

Paper
Code

Robust Robotic Pouring using Audition and Haptics

1 code implementation • 29 Feb 2020 • Hongzhuo Liang, Chuangchuang Zhou, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Marcus Stoffel, Jianwei Zhang

Both network training results and robot experiments demonstrate that MP-Net is robust against noise and changes to the task and environment.

Paper
Code

Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

no code implementations • 25 Nov 2019 • Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu, Song-Chun Zhu

Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment.

Reinforcement Learning (RL) Transfer Learning

Paper
Add Code

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

no code implementations • 16 Nov 2019 • Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu

In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement

no code implementations • NeurIPS 2019 • Chao Yang, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu, Junzhou Huang, Chuang Gan

This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations.

Imitation Learning

Paper
Add Code

Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

1 code implementation • 2 Mar 2019 • Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Jianwei Zhang

PouringNet is trained on our collected real-world pouring dataset with multimodal sensing data, which contains more than 3000 recordings of audio, force feedback, video and trajectory data of the human hand that performs the pouring task.

Robotics Sound Audio and Speech Processing

Paper
Code

Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network

4 code implementations • 17 Sep 2018 • Shuang Li, Xiaojian Ma, Hongzhuo Liang, Michael Görner, Philipp Ruppel, Bing Fang, Fuchun Sun, Jianwei Zhang

In this paper, we present TeachNet, a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands.

Robotics

Paper
Code

PointNetGPD: Detecting Grasp Configurations from Point Sets

4 code implementations • 17 Sep 2018 • Hongzhuo Liang, Xiaojian Ma, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun, Jianwei Zhang

In this paper, we propose an end-to-end grasp evaluation model to address the challenging problem of localizing robot grasp configurations directly from the point cloud.

Robotics

310

Paper
Code

Learning and Inferring Movement with Deep Generative Model

no code implementations • 18 May 2018 • Mingxuan Jing, Xiaojian Ma, Fuchun Sun, Huaping Liu

Learning and inference movement is a very challenging problem due to its high dimensionality and dependency to varied environments or tasks.

Motion Planning

Paper
Add Code

Task Transfer by Preference-Based Cost Learning

no code implementations • 12 May 2018 • Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Huaping Liu

The goal of task transfer in reinforcement learning is migrating the action policy of an agent to the target task from the source task.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.