Search Results for author: Xiaolong Wang

Found 226 papers, 85 papers with code

Test-Time Training for Generalization under Distribution Shifts

no code implementations ICML 2020 Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, University of California Moritz Hardt

We introduce a general approach, called test-time training, for improving the performance of predictive models when training and test data come from different distributions.

Image Classification Self-Supervised Learning

LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models

no code implementations4 Feb 2025 Yuto Kojima, Jiarui Xu, Xueyan Zou, Xiaolong Wang

The rapid advancements in vision-language models (VLMs), such as CLIP, have intensified the need to address distribution shifts between training and testing datasets.

Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation

no code implementations30 Jan 2025 Yuelei Li, Ge Yan, Annabella Macaluso, Mazeyu Ji, Xueyan Zou, Xiaolong Wang

In aligning high-level and low-level control for robot actions, language embeddings representing the high-level policy are jointly attended with the 3D feature field in the 3D transformer for seamless integration.

Memorization Scene Understanding +1

Diffusion Autoencoders are Scalable Image Tokenizers

no code implementations30 Jan 2025 Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra

Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers.

Image Generation Image Reconstruction

Parallel Sequence Modeling via Generalized Spatial Propagation Network

no code implementations21 Jan 2025 Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu

We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures.

16k Computational Efficiency +3

Perspective Transition of Large Language Models for Solving Subjective Tasks

no code implementations16 Jan 2025 Xiaolong Wang, Yuanchi Zhang, Ziyue Wang, Yuzhuang Xu, Fuwen Luo, Yile Wang, Peng Li, Yang Liu

Large language models (LLMs) have revolutionized the field of natural language processing, enabling remarkable progress in various tasks.

In-Context Learning Question Answering

Consistent Flow Distillation for Text-to-3D Generation

no code implementations9 Jan 2025 Runjie Yan, Yinbo Chen, Xiaolong Wang

To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient.

3D Generation Diversity +1

EditAR: Unified Conditional Generation with Autoregressive Models

no code implementations8 Jan 2025 Jiteng Mu, Nuno Vasconcelos, Xiaolong Wang

In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e. g., image editing, depth-to-image, edge-to-image, segmentation-to-image.

Conditional Image Generation Image Segmentation +1

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

no code implementations17 Dec 2024 Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj

How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language?

Prediction Trajectory Prediction +1

ExBody2: Advanced Expressive Humanoid Whole-Body Control

no code implementations17 Dec 2024 Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, Xiaolong Wang

This paper enables real-world humanoid robots to maintain stability while performing expressive motions like humans do.

Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control

no code implementations10 Dec 2024 Chenhao Lu, Xuxin Cheng, Jialong Li, Shiqi Yang, Mazeyu Ji, Chengjing Yuan, Ge Yang, Sha Yi, Xiaolong Wang

The locomotion policy is trained conditioned on this upper-body motion representation, ensuring that the system remains robust with both manipulation and locomotion.

motion retargeting Reinforcement Learning (RL)

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

no code implementations5 Dec 2024 An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Xueyan Zou, Jan Kautz, Erdem Biyik, Hongxu Yin, Sifei Liu, Xiaolong Wang

This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes.

Navigate Vision and Language Navigation

WildLMa: Long Horizon Loco-Manipulation in the Wild

no code implementations22 Nov 2024 Ri-Zhao Qiu, Yuchen Song, Xuanbin Peng, Sai Aneesh Suryadevara, Ge Yang, Minghuan Liu, Mazeyu Ji, Chengzhe Jia, Ruihan Yang, Xueyan Zou, Xiaolong Wang

`In-the-wild' mobile manipulation aims to deploy robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place.

Imitation Learning

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

1 code implementation7 Oct 2024 Ziyue Wang, Chi Chen, Fuwen Luo, Yurui Dong, Yuanchi Zhang, Yuzhuang Xu, Xiaolong Wang, Peng Li, Yang Liu

Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal.

Question Answering Visual Question Answering

GraspSplats: Efficient Manipulation with 3D Feature Splatting

no code implementations3 Sep 2024 Mazeyu Ji, Ri-Zhao Qiu, Xueyan Zou, Xiaolong Wang

With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings.

Feature Splatting NeRF

DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing

no code implementations2 Sep 2024 Xiaolong Wang, Zhi-Qi Cheng, Jue Wang, Xiaojiang Peng

To address these challenges, we introduce a new multimodal fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit).

Image Generation Language Modelling +3

FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

1 code implementation22 Aug 2024 Jue Wang, Yuxiang Lin, Tianshuo Yuan, Zhi-Qi Cheng, Xiaolong Wang, Jiao GH, Wei Chen, Xiaojiang Peng

Our approach employs a VLLM in comprehending the image content, mask, and user instructions.

ACE: A Cross-Platform Visual-Exoskeletons System for Low-Cost Dexterous Teleoperation

no code implementations21 Aug 2024 Shiqi Yang, Minghuan Liu, Yuzhe Qin, Runyu Ding, Jialong Li, Xuxin Cheng, Ruihan Yang, Sha Yi, Xiaolong Wang

Compared to previous systems, which often require hardware customization according to different robots, our single system can generalize to humanoid hands, arm-hands, arm-gripper, and quadruped-gripper systems with high-precision teleoperation.

Imitation Learning

Lessons from Learning to Spin "Pens"

no code implementations26 Jul 2024 Jun Wang, Ying Yuan, Haichuan Che, Haozhi Qi, Yi Ma, Jitendra Malik, Xiaolong Wang

This serves two purposes: 1) pre-training a sensorimotor policy in simulation; 2) conducting open-loop trajectory replay in the real world.

A Simulation Benchmark for Autonomous Racing with Large-Scale Human Data

1 code implementation23 Jul 2024 Adrian Remonda, Nicklas Hansen, Ayoub Raji, Nicola Musiu, Marko Bertogna, Eduardo Veas, Xiaolong Wang

Despite the availability of international prize-money competitions, scaled vehicles, and simulation environments, research on autonomous racing and the control of sports cars operating close to the limit of handling has been limited by the high costs of vehicle acquisition and management, as well as the limited physics accuracy of open-source simulators.

Autonomous Driving Autonomous Racing +4

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

3 code implementations5 Jul 2024 Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin

We evaluate our instantiations at the scale of 125M to 1. 3B parameters, comparing with a strong Transformer and Mamba, a modern RNN.

16k 8k +2

Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning

no code implementations3 Jul 2024 Runyu Ding, Yuzhe Qin, Jiyue Zhu, Chengzhe Jia, Shiqi Yang, Ruihan Yang, Xiaojuan Qi, Xiaolong Wang

Our system's ability to handle bimanual manipulations while prioritizing safety and real-time performance makes it a powerful tool for advancing dexterous manipulation and imitation learning.

Imitation Learning

Open-TeleVision: Teleoperation with Immersive Active Visual Feedback

no code implementations1 Jul 2024 Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, Xiaolong Wang

Teleoperation serves as a powerful method for collecting on-robot data essential for robot learning from demonstrations.

Imitation Learning

Image Neural Field Diffusion Models

no code implementations CVPR 2024 Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, Michael Gharbi

We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models.

Super-Resolution

Cross-Embodiment Robot Manipulation Skill Transfer using Latent Space Alignment

1 code implementation4 Jun 2024 Tianyu Wang, Dwait Bhatt, Xiaolong Wang, Nikolay Atanasov

We first introduce encoders and decoders to associate the states and actions of the source robot with a latent space.

Decoder Reinforcement Learning (RL) +1

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

no code implementations28 May 2024 Nicklas Hansen, Jyothir S V, Vlad Sobal, Yann Lecun, Xiaolong Wang, Hao Su

Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology.

Humanoid Control

Editable Image Elements for Controllable Synthesis

no code implementations24 Apr 2024 Jiteng Mu, Michaël Gharbi, Richard Zhang, Eli Shechtman, Nuno Vasconcelos, Xiaolong Wang, Taesung Park

In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model.

Dynamic Gaussians Mesh: Consistent Mesh Reconstruction from Monocular Videos

no code implementations18 Apr 2024 Isabella Liu, Hao Su, Xiaolong Wang

We introduce the Gaussian-Mesh Anchoring, which encourages evenly distributed Gaussians, resulting better mesh reconstruction through mesh-guided densification and pruning on the deformed Gaussians.

Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing

no code implementations1 Apr 2024 Ri-Zhao Qiu, Ge Yang, Weijia Zeng, Xiaolong Wang

Scene representations using 3D Gaussian primitives have produced excellent results in modeling the appearance of static and dynamic 3D scenes.

Feature Splatting

Visual Whole-Body Control for Legged Loco-Manipulation

no code implementations25 Mar 2024 Minghuan Liu, Zixuan Chen, Xuxin Cheng, Yandong Ji, Ri-Zhao Qiu, Ruihan Yang, Xiaolong Wang

We propose a framework that can conduct the whole-body control autonomously with visual observations.

Position

Learning Generalizable Feature Fields for Mobile Manipulation

no code implementations12 Mar 2024 Ri-Zhao Qiu, Yafei Hu, Yuchen Song, Ge Yang, Yang Fu, Jianglong Ye, Jiteng Mu, Ruihan Yang, Nikolay Atanasov, Sebastian Scherer, Xiaolong Wang

An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation.

Novel View Synthesis

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

no code implementations7 Mar 2024 Ge Yan, Yueh-Hua Wu, Xiaolong Wang

To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene.

NeRF Neural Rendering

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

no code implementations27 Feb 2024 Xiaolong Wang, Yile Wang, Yuanchi Zhang, Fuwen Luo, Peng Li, Maosong Sun, Yang Liu

Based on the characteristics of the tasks and the strong dialogue-generation capabilities of LLMs, we propose RiC (Reasoning in Conversation), a method that focuses on solving subjective tasks through dialogue simulation.

Dark Humor Detection Dialogue Generation +3

Expressive Whole-Body Control for Humanoid Robots

no code implementations26 Feb 2024 Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, Xiaolong Wang

Can we enable humanoid robots to generate rich, diverse, and expressive motions in the real world?

Imitation Learning

DEEM: Dynamic Experienced Expert Modeling for Stance Detection

1 code implementation23 Feb 2024 Xiaolong Wang, Yile Wang, Sijie Cheng, Peng Li, Yang Liu

Recent work has made a preliminary attempt to use large language models (LLMs) to solve the stance detection task, showing promising results.

Stance Detection

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

1 code implementation19 Feb 2024 Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu

While large language models (LLMs) have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages.

Transfer Learning

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

no code implementations CVPR 2024 Hongchi Xia, Yang Fu, Sifei Liu, Xiaolong Wang

WildRGB-D comprises large-scale category-level RGB-D object videos, which are taken using an iPhone to go around the objects in 360 degrees.

6D Pose Estimation Camera Pose Estimation +3

DexTouch: Learning to Seek and Manipulate Objects with Tactile Dexterity

no code implementations23 Jan 2024 Kang-Won Lee, Yuzhe Qin, Xiaolong Wang, Soo-Chul Lim

The sense of touch is an essential ability for skillfully performing a variety of tasks, providing the capacity to search and manipulate objects without relying on visual information.

Pixel-Aligned Language Model

no code implementations CVPR 2024 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region.

Language Modeling Language Modelling +1

Pixel Aligned Language Models

no code implementations14 Dec 2023 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modeling Language Modelling

Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

no code implementations14 Dec 2023 Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali-akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, Yonatan Bisk

Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models (i. e., foundation models) in research fields such as Natural Language Processing (NLP) and Computer Vision (CV), we devote this survey to exploring (i) how these existing foundation models from NLP and CV can be applied to the field of general-purpose robotics, and also exploring (ii) what a robotics-specific foundation model would look like.

COLMAP-Free 3D Gaussian Splatting

1 code implementation CVPR 2024 Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang

While neural rendering has led to impressive advances in scene reconstruction and novel view synthesis, it relies heavily on accurately pre-computed camera poses.

3DGS Camera Pose Estimation +3

Harmonic Mobile Manipulation

no code implementations11 Dec 2023 Ruihan Yang, Yejin Kim, Rose Hendrix, Aniruddha Kembhavi, Xiaolong Wang, Kiana Ehsani

Recent advancements in robotics have enabled robots to navigate complex scenes or manipulate diverse objects independently.

Navigate

Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing

no code implementations4 Dec 2023 Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, Xiaolong Wang

In this paper, we introduce a system that leverages visual and tactile sensory inputs to enable dexterous in-hand manipulation.

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

no code implementations4 Dec 2023 Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang

Given a textual description of a visual task (e. g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input.

Colorization Foreground Segmentation +3

TD-MPC2: Scalable, Robust World Models for Continuous Control

2 code implementations25 Oct 2023 Nicklas Hansen, Hao Su, Xiaolong Wang

TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model.

continuous-control Continuous Control +3

Finetuning Offline World Models in the Real World

no code implementations24 Oct 2023 Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, Xiaolong Wang

In this work, we seek to get the best of both worlds: we consider the problem of pretraining a world model with offline data collected on a real robot, and then finetuning the model on online data collected by planning with the learned model.

Offline RL Reinforcement Learning (RL)

Learning to (Learn at Test Time)

1 code implementation20 Oct 2023 Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, Xinlei Chen

Our inner loop turns out to be equivalent to linear attention when the inner-loop learner is only a linear model, and to self-attention when it is a kernel estimator.

Generalized Animal Imitator: Agile Locomotion with Versatile Motion Prior

no code implementations2 Oct 2023 Ruihan Yang, Zhuoqun Chen, Jianhan Ma, Chongyi Zheng, Yiyu Chen, Quan Nguyen, Xiaolong Wang

This paper introduces the Versatile Instructable Motion prior (VIM) - a Reinforcement Learning framework designed to incorporate a range of agile locomotion tasks suitable for advanced robotic applications.

GenSim: Generating Robotic Simulation Tasks via Large Language Models

1 code implementation2 Oct 2023 Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, Xiaolong Wang

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data.

Code Generation Diversity

3D Reconstruction with Generalizable Neural Fields using Scene Priors

no code implementations26 Sep 2023 Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu

NFP not only demonstrates SOTA scene reconstruction performance and efficiency, but it also supports single-image novel-view synthesis, which is underexplored in neural fields.

3D Reconstruction 3D Scene Reconstruction +1

Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf

1 code implementation9 Sep 2023 Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, Yang Liu

Communication games, which we refer to as incomplete information games that heavily depend on natural language communication, hold significant research value in fields such as economics, social science, and artificial intelligence.

Retrieval

SHAPE: A Sample-adaptive Hierarchical Prediction Network for Medication Recommendation

1 code implementation9 Sep 2023 Sicen Liu, Xiaolong Wang, Jingcheng Du, Yongshuai Hou, Xianbing Zhao, Hui Xu, Hui Wang, Yang Xiang, Buzhou Tang

Effectively medication recommendation with complex multimorbidity conditions is a critical task in healthcare.

PointLLM: Empowering Large Language Models to Understand Point Clouds

3 code implementations31 Aug 2023 Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua Lin

The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding.

3D Object Captioning 3D Question Answering (3D-QA) +3

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

1 code implementation31 Aug 2023 Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, Xiaolong Wang

To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e. g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel.

Decision Making

Learning Dense Correspondences between Photos and Sketches

no code implementations24 Jul 2023 Xuanchen Lu, Xiaolong Wang, Judith E Fan

Humans effortlessly grasp the connection between sketches and real-world objects, even when these sketches are far from realistic.

Contrastive Learning

Pluggable Neural Machine Translation Models via Memory-augmented Adapters

1 code implementation12 Jul 2023 Yuzhuang Xu, Shuo Wang, Peng Li, Xuebo Liu, Xiaolong Wang, Weidong Liu, Yang Liu

Although neural machine translation (NMT) models perform well in the general domain, it remains rather challenging to control their generation behavior to satisfy the requirement of different users.

Machine Translation NMT +1

Test-Time Training on Video Streams

no code implementations11 Jul 2023 Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction.

Image Reconstruction Panoptic Segmentation

Causal Kripke Models

no code implementations11 Jul 2023 Yiwen Ding, Krishna Manoorkar, Apostolos Tzimoulis, Ruoding Wang, Xiaolong Wang

This work extends Halpern and Pearl's causal models for actual causality to a possible world semantics environment.

AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System

no code implementations10 Jul 2023 Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, Dieter Fox

For real-world experiments, AnyTeleop can outperform a previous system that was designed for a specific robot hardware with a higher success rate, using the same robot.

Imitation Learning

Elastic Decision Transformer

no code implementations NeurIPS 2023 Yueh-Hua Wu, Xiaolong Wang, Masashi Hamaya

This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants.

Atari Games D4RL +1

Zero-shot Pose Transfer for Unrigged Stylized 3D Characters

1 code implementation CVPR 2023 Jiashun Wang, Xueting Li, Sifei Liu, Shalini De Mello, Orazio Gallo, Xiaolong Wang, Jan Kautz

We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training, and deforms stylized characters of significantly different shapes at inference.

Pose Transfer

DKINet: Medication Recommendation via Domain Knowledge Informed Deep Learning

1 code implementation31 May 2023 Sicen Liu, Xiaolong Wang, Xianbing Zhao, Hao Chen

While considering the clinical manifestations of the patient is important, incorporating domain-specific prior knowledge is equally significant in diagnosing the patient's health conditions.

Deep Learning

DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects

1 code implementation CVPR 2023 Chen Bao, Helin Xu, Yuzhe Qin, Xiaolong Wang

On the other hand, operating with a multi-finger robot hand will allow better approximation to human behavior and enable the robot to operate on diverse articulated objects.

Benchmarking Decision Making +2

TUVF: Learning Generalizable Texture UV Radiance Fields

no code implementations4 May 2023 An-Chieh Cheng, Xueting Li, Sifei Liu, Xiaolong Wang

This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i. e., from the same category.

3D Shape Modeling Texture Synthesis

ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

no code implementations2 May 2023 Zehao Zhu, Jiashun Wang, Yuzhe Qin, Deqing Sun, Varun Jampani, Xiaolong Wang

We propose a new dataset and a novel approach to learning hand-object interaction priors for hand and articulated object pose estimation.

Hand Pose Estimation Object

ActorsNeRF: Animatable Few-shot Human Rendering with Generalizable NeRFs

no code implementations ICCV 2023 Jiteng Mu, Shen Sang, Nuno Vasconcelos, Xiaolong Wang

While NeRF-based human representations have shown impressive novel view synthesis results, most methods still rely on a large number of images / views for training.

NeRF Novel View Synthesis

Efficient bimanual handover and rearrangement via symmetry-aware actor-critic learning

1 code implementation IEEE International Conference on Robotics and Automation (ICRA) 2023 Yunfei Li;, Chaoyi Pan, Huazhe Xu, Xiaolong Wang, Yi Wu

We develop a symmetry-aware actor-critic framework that leverages the interchangeable roles of the two manipulators in the bimanual control setting to reduce the policy search space.

Reinforcement Learning (RL)

Neural Volumetric Memory for Visual Locomotion Control

no code implementations CVPR 2023 Ruihan Yang, Ge Yang, Xiaolong Wang

To solve this problem, we follow the paradigm in computer vision that explicitly models the 3D geometry of the scene and propose Neural Volumetric Memory (NVM), a geometric memory architecture that explicitly accounts for the SE(3) equivariance of the 3D world.

3D geometry

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

1 code implementation ICCV 2023 Jianglong Ye, Naiyan Wang, Xiaolong Wang

Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images.

NeRF Neural Rendering +1

Rotating without Seeing: Towards In-hand Dexterity through Touch

no code implementations20 Mar 2023 Zhao-Heng Yin, Binghao Huang, Yuzhe Qin, Qifeng Chen, Xiaolong Wang

Relying on touch-only sensing, we can directly deploy the policy in a real robot hand and rotate novel objects that are not presented in training.

Object

Dynamic Inference With Grounding Based Vision and Language Models

no code implementations CVPR 2023 Burak Uzkent, Amanmeet Garg, Wentao Zhu, Keval Doshi, Jingru Yi, Xiaolong Wang, Mohamed Omar

For example, recent image and language models with more than 200M parameters have been proposed to learn visual grounding in the pre-training step and show impressive results on downstream vision and language tasks.

Language Modelling Referring Expression +3

Policy Adaptation from Foundation Model Feedback

no code implementations CVPR 2023 Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, Xiaolong Wang

When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations.

Decision Making model

GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

2 code implementations13 Dec 2022 Chenhongyi Yang, Jiarui Xu, Shalini De Mello, Elliot J. Crowley, Xiaolong Wang

In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder.

Decoder Image Classification +6

MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations

1 code implementation12 Dec 2022 Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran

We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework.

Deep Reinforcement Learning Model-based Reinforcement Learning +2

Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild

1 code implementation13 Oct 2022 Kaifeng Zhang, Yang Fu, Shubhankar Borse, Hong Cai, Fatih Porikli, Xiaolong Wang

While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations.

6D Pose Estimation 6D Pose Estimation using RGB +2

Visual Reinforcement Learning with Self-Supervised 3D Representations

1 code implementation13 Oct 2022 Yanjie Ze, Nicklas Hansen, Yinbo Chen, Mohit Jain, Xiaolong Wang

A prominent approach to visual Reinforcement Learning (RL) is to learn an internal state representation using self-supervised methods, which has the potential benefit of improved sample-efficiency and generalization through additional learning signal and inductive biases.

reinforcement-learning Reinforcement Learning +3

MonoNeRF: Learning Generalizable NeRFs from Monocular Videos without Camera Pose

no code implementations13 Oct 2022 Yang Fu, Ishan Misra, Xiaolong Wang

We propose a generalizable neural radiance fields - MonoNeRF, that can be trained on large-scale monocular videos of moving in static scenes without any ground-truth annotations of depth and camera poses.

Camera Pose Estimation Decoder +5

Transformers as Meta-Learners for Implicit Neural Representations

1 code implementation4 Aug 2022 Yinbo Chen, Xiaolong Wang

Motivated by a generalized formulation of gradient-based meta-learning, we propose a formulation that uses Transformers as hypernetworks for INRs, where it can directly build the whole set of INR weights with Transformers specialized as set-to-set mapping.

Meta-Learning

Graph Inverse Reinforcement Learning from Diverse Videos

no code implementations28 Jul 2022 Sateesh Kumar, Jonathan Zamora, Nicklas Hansen, Rishabh Jangir, Xiaolong Wang

Research on Inverse Reinforcement Learning (IRL) from third-person videos has shown encouraging results on removing the need for manual reward design for robotic tasks.

Diversity reinforcement-learning +3

Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

1 code implementation11 Jul 2022 Jianglong Ye, Jiashun Wang, Binghao Huang, Yuzhe Qin, Xiaolong Wang

We will first convert the large-scale human-object interaction trajectories to robot demonstrations via motion retargeting, and then use these demonstrations to train CGF.

Human-Object Interaction Detection motion retargeting

Medical Dialogue Response Generation with Pivotal Information Recalling

no code implementations17 Jun 2022 Yu Zhao, Yunxin Li, Yuxiang Wu, Baotian Hu, Qingcai Chen, Xiaolong Wang, Yuxin Ding, Min Zhang

To mitigate this problem, we propose a medical response generation model with Pivotal Information Recalling (MedPIR), which is built on two components, i. e., knowledge-aware dialogue graph encoder and recall-enhanced generator.

Dialogue Generation Graph Attention +1

Learning Implicit Feature Alignment Function for Semantic Segmentation

1 code implementation17 Jun 2022 Hanzhe Hu, Yinbo Chen, Jiarui Xu, Shubhankar Borse, Hong Cai, Fatih Porikli, Xiaolong Wang

As such, IFA implicitly aligns the feature maps at different levels and is capable of producing segmentation maps in arbitrary resolutions.

Segmentation Semantic Segmentation

MSDF: A General Open-Domain Multi-Skill Dialog Framework

no code implementations17 Jun 2022 Yu Zhao, Xinshuo Hu, Yunxin Li, Baotian Hu, Dongfang Li, Sichao Chen, Xiaolong Wang

In this paper, we propose a general Multi-Skill Dialog Framework, namely MSDF, which can be applied in different dialog tasks (e. g. knowledge grounded dialog and persona based dialog).

Decoder

CATNet: Cross-event Attention-based Time-aware Network for Medical Event Prediction

no code implementations29 Apr 2022 Sicen Liu, Xiaolong Wang, Yang Xiang, Hui Xu, Hui Wang, Buzhou Tang

It is a time-aware, event-aware and task-adaptive method with the following advantages: 1) modeling heterogeneous information and temporal information in a unified way and considering temporal irregular characteristics locally and globally respectively, 2) taking full advantage of correlations among different types of events via cross-event attention.

Time Series Analysis

From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation from Single-Camera Teleoperation

no code implementations26 Apr 2022 Yuzhe Qin, Hao Su, Xiaolong Wang

We propose to perform imitation learning for dexterous manipulation with multi-finger robot hand from human demonstrations, and transfer the policy to the real robot hand.

Imitation Learning

GIFS: Neural Implicit Function for General Shape Representation

1 code implementation CVPR 2022 Jianglong Ye, Yuntao Chen, Naiyan Wang, Xiaolong Wang

This limitation leads to tedious data processing (converting non-watertight raw data to watertight) as well as the incapability of representing general object shapes in the real world.

3D Shape Reconstruction

Learning Generalizable Dexterous Manipulation from Human Grasp Affordance

no code implementations5 Apr 2022 Yueh-Hua Wu, Jiashun Wang, Xiaolong Wang

In this paper, we propose to learn dexterous manipulation using large-scale demonstrations with diverse 3D objects in a category, which are generated from a human grasp affordance model.

Imitation Learning Representation Learning

Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos

1 code implementation CVPR 2022 Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, Xiaolong Wang

To tackle this task, we first provide an automatic way to collect trajectory and hotspots labels on large-scale data.

Object

CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs

1 code implementation CVPR 2022 Jiteng Mu, Shalini De Mello, Zhiding Yu, Nuno Vasconcelos, Xiaolong Wang, Jan Kautz, Sifei Liu

We represent the correspondence maps of different images as warped coordinate frames transformed from a canonical coordinate frame, i. e., the correspondence map, which describes the structure (e. g., the shape of a face), is controlled via a transformation.

Disentanglement

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

1 code implementation CVPR 2022 Xuanchi Ren, Xiaolong Wang

Novel view synthesis from a single image has recently attracted a lot of attention, and it has been primarily advanced by 3D deep learning and rendering techniques.

Novel View Synthesis

Temporal Difference Learning for Model Predictive Control

2 code implementations9 Mar 2022 Nicklas Hansen, Xiaolong Wang, Hao Su

Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases.

continuous-control Continuous Control +2

GroupViT: Semantic Segmentation Emerges from Text Supervision

5 code implementations CVPR 2022 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang

With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i. e., without any further fine-tuning.

Object Detection Scene Understanding +3

Multimodal data matters: language model pre-training over structured and unstructured electronic health records

1 code implementation25 Jan 2022 Sicen Liu, Xiaolong Wang, Yongshuai Hou, Ge Li, Hui Wang, Hui Xu, Yang Xiang, Buzhou Tang

As two important textual modalities in electronic health records (EHR), both structured data (clinical codes) and unstructured data (clinical narratives) have recently been increasingly applied to the healthcare domain.

Decision Making Language Modeling +2

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

no code implementations19 Jan 2022 Rishabh Jangir, Nicklas Hansen, Sambaran Ghosal, Mohit Jain, Xiaolong Wang

We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist.

Reinforcement Learning (RL)

NovelD: A Simple yet Effective Exploration Criterion

1 code implementation NeurIPS 2021 Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian

We analyze NovelD thoroughly in MiniGrid and found that empirically it helps the agent explore the environment more uniformly with a focus on exploring beyond the boundary.

Deep Reinforcement Learning Efficient Exploration +2

Learning Continuous Environment Fields via Implicit Functions

no code implementations ICLR 2022 Xueting Li, Shalini De Mello, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz, Sifei Liu

We propose a novel scene representation that encodes reaching distance -- the distance between any position in the scene to a goal along a feasible trajectory.

Position Trajectory Prediction

Online Adaptation for Implicit Object Tracking and Shape Reconstruction in the Wild

1 code implementation24 Nov 2021 Jianglong Ye, Yuntao Chen, Naiyan Wang, Xiaolong Wang

Tracking and reconstructing 3D objects from cluttered scenes are the key components for computer vision, robotics and autonomous driving systems.

3D Shape Reconstruction Autonomous Driving +1

Multi-Person 3D Motion Prediction with Multi-Range Transformers

1 code implementation NeurIPS 2021 Jiashun Wang, Huazhe Xu, Medhini Narasimhan, Xiaolong Wang

Thus, instead of predicting each human pose trajectory in isolation, we introduce a Multi-Range Transformers model which contains of a local-range encoder for individual motion and a global-range encoder for social interactions.

Decoder motion prediction +3

Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization

1 code implementation29 Sep 2021 Chieko Sarah Imai, Minghao Zhang, Yuchen Zhang, Marcin Kierebinski, Ruihan Yang, Yuzhe Qin, Xiaolong Wang

While Reinforcement Learning (RL) provides a promising paradigm for agile locomotion skills with vision inputs in simulation, it is still very challenging to deploy the RL policy in the real world.

Reinforcement Learning (RL)

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

1 code implementation12 Aug 2021 Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, Xiaolong Wang

While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation.

Imitation Learning motion retargeting +1

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

1 code implementation ICLR 2022 Ruihan Yang, Minghao Zhang, Nicklas Hansen, Huazhe Xu, Xiaolong Wang

Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead.

Reinforcement Learning (RL)

Sentence-level Online Handwritten Chinese Character Recognition

no code implementations4 Jul 2021 Yunxin Li, Qian Yang, Qingcai Chen, Lin Ma, Baotian Hu, Xiaolong Wang, Yuxin Ding

Single online handwritten Chinese character recognition~(single OLHCCR) has achieved prominent performance.

Sentence Word Embeddings

GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph

no code implementations1 Jul 2021 Yunxin Li, Yu Zhao, Baotian Hu, Qingcai Chen, Yang Xiang, Xiaolong Wang, Yuxin Ding, Lin Ma

Previous works indicate that the glyph of Chinese characters contains rich semantic information and has the potential to enhance the representation of Chinese characters.

Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation

3 code implementations NeurIPS 2021 Nicklas Hansen, Hao Su, Xiaolong Wang

Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL in environments with unseen visuals.

Data Augmentation Q-Learning +1

Single RGB-D Camera Teleoperation for General Robotic Manipulation

no code implementations28 Jun 2021 Quan Vuong, Yuzhe Qin, Runlin Guo, Xiaolong Wang, Hao Su, Henrik Christensen

We propose a teleoperation system that uses a single RGB-D camera as the human motion capture device.

Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

1 code implementation CVPR 2021 Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, Xiaolong Wang

Estimating 3D hand and object pose from a single image is an extremely challenging problem: hands and objects are often self-occluded during interactions, and the 3D annotations are scarce as even humans cannot directly label the ground-truths from a single image perfectly.

hand-object pose Object

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency

no code implementations ICCV 2021 Haiping Wu, Xiaolong Wang

In this paper, we propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.

Action Recognition Contrastive Learning +4

Robust Object Detection via Instance-Level Temporal Cycle Confusion

1 code implementation ICCV 2021 Xin Wang, Thomas E. Huang, Benlin Liu, Fisher Yu, Xiaolong Wang, Joseph E. Gonzalez, Trevor Darrell

Building reliable object detectors that are robust to domain shifts, such as various changes in context, viewpoint, and object appearances, is critical for real-world applications.

Object object-detection +2

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation

1 code implementation ICCV 2021 Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, Xiaolong Wang

To deal with the large shape variance, we introduce Articulated Signed Distance Functions (A-SDF) to represent articulated shapes with a disentangled latent space, where we have separate codes for encoding shape and articulation.

Test-time Adaptation

Hand-Object Contact Consistency Reasoning for Human Grasps Generation

no code implementations ICCV 2021 Hanwen Jiang, Shaowei Liu, Jiashun Wang, Xiaolong Wang

Based on the hand-object contact consistency, we design novel objectives in training the human grasp generation model and also a new self-supervised task which allows the grasp generation network to be adjusted even during test time.

Grasp Generation Object +1

Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective

5 code implementations ICCV 2021 Jiarui Xu, Xiaolong Wang

To learn generalizable representation for correspondence in large-scale, a variety of self-supervised pretext tasks are proposed to explicitly perform object-level or patch-level similarity learning.

Contrastive Learning Object +5

Region Similarity Representation Learning

1 code implementation ICCV 2021 Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer, Trevor Darrell

We present Region Similarity Representation Learning (ReSim), a new approach to self-supervised representation learning for localization-based tasks such as object detection and segmentation.

Instance Segmentation Object +5

Solving Compositional Reinforcement Learning Problems via Task Reduction

1 code implementation ICLR 2021 Yunfei Li, Yilin Wu, Huazhe Xu, Xiaolong Wang, Yi Wu

We propose a novel learning paradigm, Self-Imitation via Reduction (SIR), for solving compositional reinforcement learning problems.

continuous-control Continuous Control +3

Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization

2 code implementations ICLR 2021 Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Du, Yu Wang, Yi Wu

We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games.

BeBold: Exploration Beyond the Boundary of Explored Regions

2 code implementations15 Dec 2020 Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian

In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR.

Deep Reinforcement Learning Efficient Exploration +1

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

1 code implementation CVPR 2021 Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, Xiaolong Wang

Synthesizing 3D human motion plays an important role in many graphics applications as well as understanding human activity.

Motion Synthesis

Online Adaptation for Consistent Mesh Reconstruction in the Wild

no code implementations NeurIPS 2020 Xueting Li, Sifei Liu, Shalini De Mello, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild.

3D Reconstruction

MedWriter: Knowledge-Aware Medical Text Generation

no code implementations COLING 2020 Youcheng Pan, Qingcai Chen, Weihua Peng, Xiaolong Wang, Baotian Hu, Xin Liu, Junying Chen, Wenxiu Zhou

To exploit the domain knowledge to guarantee the correctness of generated text has been a hot topic in recent years, especially for high professional domains such as medical.

Text Generation

Generalization in Reinforcement Learning by Soft Data Augmentation

3 code implementations26 Nov 2020 Nicklas Hansen, Xiaolong Wang

Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation.

Data Augmentation reinforcement-learning +2

Multi-Agent Collaboration via Reward Attribution Decomposition

2 code implementations16 Oct 2020 Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian

In this work, we propose Collaborative Q-learning (CollaQ) that achieves state-of-the-art performance in the StarCraft multi-agent challenge and supports ad hoc team play.

Dota 2 Multi-agent Reinforcement Learning +2

Reducing Class Collapse in Metric Learning with Easy Positive Sampling

no code implementations28 Sep 2020 Elad Levi, Tete Xiao, Xiaolong Wang, Trevor Darrell

We theoretically prove and empirically show that under reasonable noise assumptions, prevalent embedding losses in metric learning, e. g., triplet loss, tend to project all samples of a class with various modes onto a single point in the embedding space, resulting in a class collapse that usually renders the space ill-sorted for classification or retrieval.

Image Retrieval Metric Learning +2

Hierarchical Style-based Networks for Motion Synthesis

no code implementations ECCV 2020 Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, Xiaolong Wang, Trevor Darrell

Generating diverse and natural human motion is one of the long-standing goals for creating intelligent characters in the animated world.

Motion Synthesis