DataMix: Efficient Privacy-Preserving Edge-Cloud Inference

no code implementations ECCV 2020 Zhijian Liu, Zhanghao Wu, Chuang Gan, Ligeng Zhu, Song Han

Third, our solution is extit{efficient} on the edge since the majority of the workload is delegated to the cloud, and our mixing and de-mixing processes introduce very few extra computations.

CoNav: A Benchmark for Human-Centered Collaborative Navigation

1 code implementation4 Jun 2024 Changhao Li, Xinyu Sun, Peihao Chen, Jugang Fan, Zixu Wang, Yanxia Liu, Jinhui Zhu, Chuang Gan, Mingkui Tan

To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human's intended destination in advance of the human.


RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

no code implementations30 May 2024 Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan

In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation.

LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery

1 code implementation16 May 2024 Pingchuan Ma, Tsun-Hsuan Wang, Minghao Guo, Zhiqing Sun, Joshua B. Tenenbaum, Daniela Rus, Chuang Gan, Wojciech Matusik

Large Language Models have recently gained significant attention in scientific discovery for their extensive knowledge and advanced reasoning capabilities.

STAR: A Benchmark for Situated Reasoning in Real-World Videos

no code implementations NeurIPS 2021 Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan

This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark).

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

1 code implementation7 May 2024 Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.

Virtual Foundry Graphnet for Metal Sintering Deformation Prediction

1 code implementation17 Apr 2024 Rachel, Chen, Juheon Lee, Chuang Gan, Zijiang Yang, Mohammad Amin Nabian, Jun Zeng

Metal Sintering is a necessary step for Metal Injection Molded parts and binder jet such as HP's metal 3D printer.

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

no code implementations16 Apr 2024 Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Chuang Gan

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world.

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

1 code implementation14 Mar 2024 Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan

This paper answers this question in the context of tackling hard reasoning tasks (e. g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e. g., level 1-3 MATH problems), which we term as \textit{easy-to-hard generalization}.

3D-VLA: A 3D Vision-Language-Action Generative World Model

no code implementations14 Mar 2024 Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

no code implementations9 Feb 2024 Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, Chuang Gan

We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset.

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

no code implementations30 Jan 2024 Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.

HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

1 code implementation23 Jan 2024 Qinhong Zhou, Sunli Chen, Yisong Wang, Haozhe Xu, Weihua Du, Hongxin Zhang, Yilun Du, Joshua B. Tenenbaum, Chuang Gan

Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world.

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

no code implementations CVPR 2024 Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world.

DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics

no code implementations NeurIPS 2023 Zhiao Huang, Feng Chen, Yewen Pu, Chunru Lin, Hao Su, Chuang Gan

Combining gradient-based trajectory optimization with differentiable physics simulation is an efficient technique for solving soft-body manipulation problems.


DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning

no code implementations10 Dec 2023 Kunyang Lin, Yufeng Wang, Peihao Chen, Runhao Zeng, Siyuan Zhou, Mingkui Tan, Chuang Gan

In this paper, we propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents by utilizing intrinsic rewards to learn the optimal policy for each agent.

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

no code implementations6 Nov 2023 Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan

A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far.

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

no code implementations26 Oct 2023 Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e. g., locally fine-tuning large language models on personalized data).

Autonomous Tree-search Ability of Large Language Models

no code implementations14 Oct 2023 Zheyu Zhang, Zhuorui Ye, Yikang Shen, Chuang Gan

This approach yield a greater improvement compared to the ones fine-tuned on CoT data.

Sparse Universal Transformer

1 code implementation11 Oct 2023 Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

no code implementations ICCV 2023 Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

To tackle this problem, we propose a new framework TextPSG consisting of four modules, i. e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques.

SALMON: Self-Alignment with Instructable Reward Models

1 code implementation9 Oct 2023 Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents.

Generalizable Long-Horizon Manipulations with Large Language Models

no code implementations3 Oct 2023 Haoyu Zhou, Mingyu Ding, Weikun Peng, Masayoshi Tomizuka, Lin Shao, Chuang Gan

This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations with novel objects and unseen tasks.

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

no code implementations28 Sep 2023 Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, Liam Paull

We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts.

Aligning Large Multimodal Models with Factually Augmented RLHF

no code implementations25 Sep 2023 Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell

Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context.

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

no code implementations15 Aug 2023 Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen Liu, Mingkui Tan, Chuang Gan

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data.

Learning Vision-and-Language Navigation from YouTube Videos

1 code implementation ICCV 2023 Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.

Reparameterized Policy Learning for Multimodal Trajectory Optimization

no code implementations20 Jul 2023 Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su

We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency.

Building Cooperative Embodied Agents Modularly with Large Language Models

1 code implementation5 Jul 2023 Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan

In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments.

ModuleFormer: Modularity Emerges from Mixture-of-Experts

1 code implementation7 Jun 2023 Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan

In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.

SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

no code implementations31 May 2023 Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, Daniela Rus

Diffusion model-based approaches have shown promise in data-driven planning, but there are no safety guarantees, thus making it hard to be applied for safety-critical applications.


Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

1 code implementation NeurIPS 2023 Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable.

Learning Neural Constitutive Laws From Motion Observations for Generalizable PDE Dynamics

no code implementations27 Apr 2023 Pingchuan Ma, Peter Yichen Chen, Bolei Deng, Joshua B. Tenenbaum, Tao Du, Chuang Gan, Wojciech Matusik

Many NN approaches learn an end-to-end model that implicitly models both the governing PDE and constitutive models (or material models).

EC^2: Emergent Communication for Embodied Control

no code implementations19 Apr 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Learning Situation Hyper-Graphs for Video Question Answering

1 code implementation CVPR 2023 Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah

The proposed method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian matching loss for the situation graph prediction.

Hyper-Decision Transformer for Efficient Online Policy Adaptation

no code implementations17 Apr 2023 Mengdi Xu, Yuchen Lu, Yikang Shen, Shun Zhang, Ding Zhao, Chuang Gan

To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data- and parameter-efficient manner.

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

no code implementations7 Apr 2023 Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program.

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

1 code implementation CVPR 2023 Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.

Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

no code implementations CVPR 2023 Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan

Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound.

DexDeform: Dexterous Deformable Object Manipulation with Human Demonstrations and Differentiable Physics

no code implementations27 Mar 2023 Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, Chuang Gan

Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects.

3D Concept Learning and Reasoning from Multi-View Images

no code implementations CVPR 2023 Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan

We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations.

Planning with Large Language Models for Code Generation

no code implementations9 Mar 2023 Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process.

PAC-NeRF: Physics Augmented Continuum Neural Radiance Fields for Geometry-Agnostic System Identification

no code implementations9 Mar 2023 Xuan Li, Yi-Ling Qiao, Peter Yichen Chen, Krishna Murthy Jatavallabhula, Ming Lin, Chenfanfu Jiang, Chuang Gan

In this work, we aim to identify parameters characterizing a physical system from a set of multi-view videos without any assumption on object geometry or topology.

FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation

1 code implementation4 Mar 2023 Zhou Xian, Bo Zhu, Zhenjia Xu, Hsiao-Yu Tung, Antonio Torralba, Katerina Fragkiadaki, Chuang Gan

We identify several challenges for fluid manipulation learning by evaluating a set of reinforcement learning and trajectory optimization methods on our platform.


EC2: Emergent Communication for Embodied Control

no code implementations CVPR 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction

no code implementations ICCV 2023 Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

Without performance loss on Cityscapes, our EfficientViT provides up to 8. 8x and 3. 8x GPU latency reduction over SegFormer and SegNeXt, respectively.

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

no code implementations15 Dec 2022 Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik Learned-Miller, Chuang Gan

To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad').

CLAWSAT: Towards Both Robust and Accurate Code Models

1 code implementation21 Nov 2022 Jinghan Jia, Shashank Srikant, Tamara Mitrovska, Chuang Gan, Shiyu Chang, Sijia Liu, Una-May O'Reilly

We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models.

Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation

no code implementations27 Oct 2022 Xingyu Lin, Carl Qi, Yunchu Zhang, Zhiao Huang, Katerina Fragkiadaki, Yunzhu Li, Chuang Gan, David Held

Effective planning of long-horizon deformable object manipulation requires suitable abstractions at both the spatial and temporal levels.

JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions

1 code implementation18 Oct 2022 Mo Yu, Yi Gu, Xiaoxiao Guo, Yufei Feng, Xiaodan Zhu, Michael Greenspan, Murray Campbell, Chuang Gan

Hence, in order to achieve higher performance on our tasks, models need to effectively utilize such functional knowledge to infer the outcomes of actions, rather than relying solely on memorizing facts.

Revisiting the Roles of "Text" in Text Games

no code implementations15 Oct 2022 Yi Gu, Shunyu Yao, Chuang Gan, Joshua B. Tenenbaum, Mo Yu

Text games present opportunities for natural language understanding (NLU) methods to tackle reinforcement learning (RL) challenges.

Learning Active Camera for Multi-Object Navigation

no code implementations14 Oct 2022 Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

How to make robots perceive the environment as efficiently as humans is a fundamental problem in robotics.

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

1 code implementation14 Oct 2022 Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.

On the Forward Invariance of Neural ODEs

no code implementations10 Oct 2022 Wei Xiao, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Yutong Ban, Chuang Gan, Daniela Rus

We propose a new method to ensure neural ordinary differential equations (ODEs) satisfy output specifications by using invariance set propagation.

Gait Recognition in the Wild with Multi-hop Temporal Switch

1 code implementation1 Sep 2022 Jinkai Zheng, Xinchen Liu, Xiaoyan Gu, Yaoqi Sun, Chuang Gan, Jiyong Zhang, Wu Liu, Chenggang Yan

Current methods that obtain state-of-the-art performance on in-the-lab benchmarks achieve much worse accuracy on the recently proposed in-the-wild datasets because these methods can hardly model the varied temporal dynamics of gait sequences in unconstrained scenes.

Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

1 code implementation22 Jul 2022 Hongbin Lin, Yifan Zhang, Zhen Qiu, Shuaicheng Niu, Chuang Gan, Yanxia Liu, Mingkui Tan

2) Prototype-based alignment and replay: based on the identified label prototypes, we align both domains and enforce the model to retain previous knowledge.

3D Concept Grounding on Neural Fields

no code implementations13 Jul 2022 Yining Hong, Yilun Du, Chunru Lin, Joshua B. Tenenbaum, Chuang Gan

Experimental results show that our proposed framework outperforms unsupervised/language-mediated segmentation models on semantic and instance segmentation tasks, as well as outperforms existing models on the challenging 3D aware visual reasoning tasks.

Weakly Supervised Grounding for VQA in Vision-Language Transformers

1 code implementation5 Jul 2022 Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels da Vitoria Lobo, Mubarak Shah

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding.

On-Device Training Under 256KB Memory

1 code implementation30 Jun 2022 Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors.

SNAKE: Shape-aware Neural 3D Keypoint Field

1 code implementation3 Jun 2022 Chengliang Zhong, Peixing You, Xiaoxue Chen, Hao Zhao, Fuchun Sun, Guyue Zhou, Xiaodong Mu, Chuang Gan, Wenbing Huang

Detecting 3D keypoints from point clouds is important for shape reconstruction, while this work investigates the dual question: can shape reconstruction benefit 3D keypoint detection?

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

5 code implementations29 May 2022 Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

Without performance loss on Cityscapes, our EfficientViT provides up to 13. 9$\times$ and 6. 2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively.

Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics

no code implementations ICLR 2022 Sizhe Li, Zhiao Huang, Tao Du, Hao Su, Joshua B. Tenenbaum, Chuang Gan

Extensive experimental results suggest that: 1) on multi-stage tasks that are infeasible for the vanilla differentiable physics solver, our approach discovers contact points that efficiently guide the solver to completion; 2) on tasks where the vanilla solver performs sub-optimally or near-optimally, our contact point discovery method performs better than or on par with the manipulation performance obtained with handcrafted contact points.

Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

no code implementations CVPR 2022 Yining Hong, Kaichun Mo, Li Yi, Leonidas J. Guibas, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

Specifically, FixNet consists of a perception module to extract the structured representation from the 3D point cloud, a physical dynamics prediction module to simulate the results of interactions on 3D objects, and a functionality prediction module to evaluate the functionality and choose the correct fix.

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

no code implementations ICLR 2022 Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset.

Learning Neural Acoustic Fields

1 code implementation4 Apr 2022 Andrew Luo, Yilun Du, Michael J. Tarr, Joshua B. Tenenbaum, Antonio Torralba, Chuang Gan

By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds.

FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations

no code implementations ICLR 2022 Lingjie Mei, Jiayuan Mao, Ziqi Wang, Chuang Gan, Joshua B. Tenenbaum

We present a meta-learning framework for learning new visual concepts quickly, from just one or a few examples, guided by multiple naturally occurring data streams: simultaneously looking at images, reading sentences that describe the objects in the scene, and interpreting supplemental sentences that relate the novel concept with other concepts.

Linking Emergent and Natural Languages via Corpus Transfer

1 code implementation ICLR 2022 Shunyu Yao, Mo Yu, Yang Zhang, Karthik R Narasimhan, Joshua B. Tenenbaum, Chuang Gan

In this work, we propose a novel way to establish such a link by corpus transfer, i. e. pretraining on a corpus of emergent language for downstream natural language tasks, which is in contrast to prior work that directly transfers speaker and listener parameters.

AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation

1 code implementation CVPR 2022 Xueyi Liu, Xiaomeng Xu, Anyi Rao, Chuang Gan, Li Yi

To solve the above issues, we propose AutoGPart, a generic method enabling training generalizable 3D part segmentation networks with the task prior considered.

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

no code implementations NeurIPS 2021 Yining Hong, Li Yi, Joshua B. Tenenbaum, Antonio Torralba, Chuang Gan

A critical aspect of human visual perception is the ability to parse visual scenes into individual objects and further into object parts, forming part-whole hierarchies.

Graph Convolutional Module for Temporal Action Localization in Videos

no code implementations1 Dec 2021 Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan

To this end, we propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods, including two-stage and one-stage paradigms.

Memory-efficient Patch-based Inference for Tiny Deep Learning

no code implementations NeurIPS 2021 Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.

Image Classification Neural Architecture Search +3

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

no code implementations NeurIPS 2021 Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

1 code implementation28 Oct 2021 Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.

Network Augmentation for Tiny Deep Learning

no code implementations ICLR 2022 Han Cai, Chuang Gan, Ji Lin, Song Han

We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks.

OPEn: An Open-ended Physics Environment for Learning Without a Task

1 code implementation13 Oct 2021 Chuang Gan, Abhishek Bhandwaldar, Antonio Torralba, Joshua B. Tenenbaum, Phillip Isola

We test several existing RL-based exploration methods on this benchmark and find that an agent using unsupervised contrastive learning for representation learning, and impact-driven learning for exploration, achieved the best results.

Inducing Reusable Skills From Demonstrations with Option-Controller Network

no code implementations29 Sep 2021 Siyuan Zhou, Yikang Shen, Yuchen Lu, Aaron Courville, Joshua B. Tenenbaum, Chuang Gan

With the isolation of information and the synchronous calling mechanism, we can impose a division of works between the controller and options in an end-to-end training regime.

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

4 code implementations27 Sep 2021 Ji Lin, Chuang Gan, Kuan Wang, Song Han

Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8.

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

1 code implementation2 Aug 2021 Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu

By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery.

Certifiably Robust Interpretation via Renyi Differential Privacy

no code implementations4 Jul 2021 Ao Liu, Xiaoyu Chen, Sijia Liu, Lirong Xia, Chuang Gan

The advantages of our Renyi-Robust-Smooth (RDP-based interpretation method) are three-folds.

Global Rhythm Style Transfer Without Text Transcriptions

1 code implementation16 Jun 2021 Kaizhi Qian, Yang Zhang, Shiyu Chang, JinJun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.

Temporal and Object Quantification Networks

no code implementations10 Jun 2021 Jiayuan Mao, Zhezheng Luo, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu, Leslie Pack Kaelbling, Tomer D. Ullman

We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events.

Adversarial Option-Aware Hierarchical Imitation Learning

1 code implementation10 Jun 2021 Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI

In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

1 code implementation CVPR 2021 Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah

In this paper, we focus on a more relaxed setting: the grounding of relevant visual entities in a weakly supervised manner by training on the VQA task alone.

PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics

1 code implementation ICLR 2021 Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B. Tenenbaum, Chuang Gan

Experimental results suggest that 1) RL-based approaches struggle to solve most of the tasks efficiently; 2) gradient-based approaches, by optimizing open-loop control sequences with the built-in differentiable physics engine, can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning.

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

2 code implementations28 Mar 2021 Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, Xavier Alameda-Pineda

Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN).

Learning Task Decomposition with Ordered Memory Policy Network

no code implementations19 Mar 2021 Yuchen Lu, Yikang Shen, Siyuan Zhou, Aaron Courville, Joshua B. Tenenbaum, Chuang Gan

The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstruc-tured demonstration.

AGENT: A Benchmark for Core Psychological Reasoning

no code implementations24 Feb 2021 Tianmin Shu, Abhishek Bhandwaldar, Chuang Gan, Kevin A. Smith, Shari Liu, Dan Gutfreund, Elizabeth Spelke, Joshua B. Tenenbaum, Tomer D. Ullman

For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life.

On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning

1 code implementation ICLR 2021 Ren Wang, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Chuang Gan, Meng Wang

Despite the generalization power of the meta-model, it remains elusive that how adversarial robustness can be maintained by MAML in few-shot learning.

Temporal and Object Quantification Nets

no code implementations1 Jan 2021 Jiayuan Mao, Zhezheng Luo, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu, Leslie Pack Kaelbling, Tomer Ullman

We aim to learn generalizable representations for complex activities by quantifying over both entities and time, as in “the kicker is behind all the other players,” or “the player controls the ball until it moves toward the goal.” Such a structural inductive bias of object relations, object quantification, and temporal orders will enable the learned representation to generalize to situations with varying numbers of agents, objects, and time courses.

Object-Centric Diagnosis of Visual Reasoning

no code implementations21 Dec 2020 Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox, Joshua B. Tenenbaum, Chuang Gan

In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy.

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

3 code implementations13 Dec 2020 Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding

Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance.

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

1 code implementation27 Oct 2020 Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, Chuang Gan

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition.

Synthetic Training for Monocular Human Mesh Recovery

no code implementations27 Oct 2020 Yu Sun, Qian Bao, Wu Liu, Wenpeng Gao, Yili Fu, Chuang Gan, Tao Mei

To solve this problem, we design a multi-branch framework to disentangle the regression of different body properties, enabling us to separate each component's training in a synthetic training manner using unpaired data available.

Location-aware Graph Convolutional Networks for Video Question Answering

1 code implementation7 Aug 2020 Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan

In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction.

Noisy Agents: Self-supervised Exploration by Predicting Auditory Events

no code implementations27 Jul 2020 Chuang Gan, Xiaoyu Chen, Phillip Isola, Antonio Torralba, Joshua B. Tenenbaum

Humans integrate multiple sensory modalities (e. g. visual and audio) to build a causal understanding of the physical world.

TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning

1 code implementation NeurIPS 2020 Han Cai, Chuang Gan, Ligeng Zhu, Song Han

Furthermore, combined with feature extractor adaptation, TinyTL provides 7. 3-12. 9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

Foley Music: Learning to Generate Music from Videos

no code implementations ECCV 2020 Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments.

MCUNet: Tiny Deep Learning on IoT Devices

1 code implementation NeurIPS 2020 Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han

Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones.

Generating Visually Aligned Sound from Videos

1 code implementation14 Jul 2020 Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan

During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features.

Language Guided Networks for Cross-modal Moment Retrieval

no code implementations18 Jun 2020 Kun Liu, Huadong Ma, Chuang Gan

In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.

A Real-time Action Representation with Temporal Encoding and Deep Compression

no code implementations17 Jun 2020 Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, Chuang Gan

Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5. 4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

4 code implementations ACL 2020 Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han

To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search.

Once for All: Train One Network and Specialize it for Efficient Deployment

1 code implementation ICLR 2020 Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

Most of the traditional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally expensive and unscalable.

Deep Audio Priors Emerge From Harmonic Convolutional Networks

no code implementations ICLR 2020 Zhoutong Zhang, Yunyun Wang, Chuang Gan, Jiajun Wu, Joshua B. Tenenbaum, Antonio Torralba, William T. Freeman

Dense Regression Network for Video Grounding

1 code implementation CVPR 2020 Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, Chuang Gan

The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.

Visual Concept-Metaconcept Learning

1 code implementation NeurIPS 2019 Chi Han, Jiayuan Mao, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu

Humans reason with concepts and metaconcepts: we recognize red and green from visual input; we also understand that they describe the same property of objects (i. e., the color).

1 code implementation25 Dec 2019 Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum

In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data.


Cross-channel Communication Networks

1 code implementation NeurIPS 2019 Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh

Self-supervised Moving Vehicle Tracking with Stereo Sound

no code implementations ICCV 2019 Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba

At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input.

TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation

no code implementations14 Oct 2019 Fan Yang, Xiao Liu, Dongliang He, Chuang Gan, Jian Wang, Chao Li, Fu Li, Shilei Wen

In this work, we introduce a new problem, named as {\em story-preserving long video truncation}, that requires an algorithm to automatically truncate a long-duration video into multiple short and attractive sub-videos with each one containing an unbroken story.

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

3 code implementations ICLR 2020 Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum

While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations.

Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

1 code implementation1 Oct 2019 Ji Lin, Chuang Gan, Song Han

With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74. 0%, which is 1. 6x and 2. 9x faster than previous 3D video models with higher accuracy.

Graph Convolutional Networks for Temporal Action Localization

1 code implementation ICCV 2019 Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan

Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.

Once-for-All: Train One Network and Specialize it for Efficient Deployment

10 code implementations26 Aug 2019 Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4. 0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1. 5x faster than MobileNetV3, 2. 6x faster than EfficientNet w. r. t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission.

Deep Concept-wise Temporal Convolutional Networks for Action Localization

2 code implementations26 Aug 2019 Xin Li, Tianwei Lin, Xiao Liu, Chuang Gan, WangMeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen

In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution.

Self-Supervised Audio-Visual Co-Segmentation

no code implementations18 Apr 2019 Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh Mcdermott, Antonio Torralba

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data.

Defensive Quantization: When Efficiency Meets Robustness

no code implementations ICLR 2019 Ji Lin, Chuang Gan, Song Han

This paper aims to raise people's awareness about the security of the quantized models, and we designed a novel quantization methodology to jointly optimize the efficiency and robustness of deep learning models.

The Sound of Motions

1 code implementation ICCV 2019 Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba

Interpreting Adversarial Examples by Activation Promotion and Suppression

no code implementations3 Apr 2019 Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, Xue Lin

It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: images with imperceptible perturbations crafted to fool classifiers.

TSM: Temporal Shift Module for Efficient Video Understanding

13 code implementations ICCV 2019 Ji Lin, Chuang Gan, Song Han

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

8 code implementations5 Nov 2018 Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Li-Min Wang, Shilei Wen

In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network