Search Results for author: Chuang Gan

Found 170 papers, 72 papers with code

DataMix: Efficient Privacy-Preserving Edge-Cloud Inference

no code implementations ECCV 2020 Zhijian Liu, Zhanghao Wu, Chuang Gan, Ligeng Zhu, Song Han

Third, our solution is extit{efficient} on the edge since the majority of the workload is delegated to the cloud, and our mixing and de-mixing processes introduce very few extra computations.

Privacy Preserving speech-recognition +1

Virtual Foundry Graphnet for Metal Sintering Deformation Prediction

no code implementations17 Apr 2024 Rachel, Chen, Juheon Lee, Chuang Gan, Zijiang Yang, Mohammad Amin Nabian, Jun Zeng

Metal Sintering is a necessary step for Metal Injection Molded parts and binder jet such as HP's metal 3D printer.

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

no code implementations16 Apr 2024 Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Chuang Gan

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only partial egocentric views of the world.

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

1 code implementation14 Mar 2024 Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan

This paper answers this question in the context of tackling hard reasoning tasks (e. g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e. g., level 1-3 MATH problems), which we term as \textit{easy-to-hard generalization}.

Math Reinforcement Learning (RL) +1

3D-VLA: A 3D Vision-Language-Action Generative World Model

no code implementations14 Mar 2024 Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.

Language Modelling Large Language Model +1

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

no code implementations9 Feb 2024 Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, Chuang Gan

We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset.

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

no code implementations30 Jan 2024 Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.

Language Modelling Large Language Model +1

HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

1 code implementation23 Jan 2024 Qinhong Zhou, Sunli Chen, Yisong Wang, Haozhe Xu, Weihua Du, Hongxin Zhang, Yilun Du, Joshua B. Tenenbaum, Chuang Gan

Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world.

Common Sense Reasoning Decision Making +1

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

no code implementations16 Jan 2024 Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world.

Language Modelling Large Language Model

DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics

no code implementations NeurIPS 2023 Zhiao Huang, Feng Chen, Yewen Pu, Chunru Lin, Hao Su, Chuang Gan

Combining gradient-based trajectory optimization with differentiable physics simulation is an efficient technique for solving soft-body manipulation problems.

valid

DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning

no code implementations10 Dec 2023 Kunyang Lin, Yufeng Wang, Peihao Chen, Runhao Zeng, Siyuan Zhou, Mingkui Tan, Chuang Gan

In this paper, we propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents by utilizing intrinsic rewards to learn the optimal policy for each agent.

Multi-agent Reinforcement Learning reinforcement-learning +2

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

no code implementations6 Nov 2023 Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan

A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far.

CoLA Question Answering +5

RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

no code implementations2 Nov 2023 YuFei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Zackory Erickson, David Held, Chuang Gan

We present RoboGen, a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation.

Motion Planning

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

no code implementations26 Oct 2023 Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e. g., locally fine-tuning large language models on personalized data).

Privacy Preserving

Autonomous Tree-search Ability of Large Language Models

no code implementations14 Oct 2023 Zheyu Zhang, Zhuorui Ye, Yikang Shen, Chuang Gan

This approach yield a greater improvement compared to the ones fine-tuned on CoT data.

Decision Making

Sparse Universal Transformer

no code implementations11 Oct 2023 Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

no code implementations ICCV 2023 Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

To tackle this problem, we propose a new framework TextPSG consisting of four modules, i. e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques.

Graph Generation Panoptic Scene Graph Generation +1

SALMON: Self-Alignment with Instructable Reward Models

1 code implementation9 Oct 2023 Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents.

In-Context Learning Language Modelling

Generalizable Long-Horizon Manipulations with Large Language Models

no code implementations3 Oct 2023 Haoyu Zhou, Mingyu Ding, Weikun Peng, Masayoshi Tomizuka, Lin Shao, Chuang Gan

This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations with novel objects and unseen tasks.

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

no code implementations28 Sep 2023 Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, Liam Paull

We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts.

Aligning Large Multimodal Models with Factually Augmented RLHF

no code implementations25 Sep 2023 Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell

Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context.

Hallucination Image Captioning +1

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

no code implementations15 Aug 2023 Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen Liu, Mingkui Tan, Chuang Gan

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data.

Navigate Robot Navigation +1

Learning Vision-and-Language Navigation from YouTube Videos

1 code implementation ICCV 2023 Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.

Navigate Vision and Language Navigation

Reparameterized Policy Learning for Multimodal Trajectory Optimization

no code implementations20 Jul 2023 Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su

We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency.

Reinforcement Learning (RL)

Building Cooperative Embodied Agents Modularly with Large Language Models

1 code implementation5 Jul 2023 Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan

In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments.

Text Generation

ModuleFormer: Modularity Emerges from Mixture-of-Experts

1 code implementation7 Jun 2023 Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan

In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.

Language Modelling

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

5 code implementations1 Jun 2023 Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Chuang Gan, Song Han

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth).

Common Sense Reasoning Language Modelling +1

SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

no code implementations31 May 2023 Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, Daniela Rus

Diffusion model-based approaches have shown promise in data-driven planning, but there are no safety guarantees, thus making it hard to be applied for safety-critical applications.

Denoising

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

1 code implementation NeurIPS 2023 Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable.

In-Context Learning Language Modelling

Learning Neural Constitutive Laws From Motion Observations for Generalizable PDE Dynamics

no code implementations27 Apr 2023 Pingchuan Ma, Peter Yichen Chen, Bolei Deng, Joshua B. Tenenbaum, Tao Du, Chuang Gan, Wojciech Matusik

Many NN approaches learn an end-to-end model that implicitly models both the governing PDE and constitutive models (or material models).

Out-of-Distribution Generalization

EC^2: Emergent Communication for Embodied Control

no code implementations19 Apr 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Contrastive Learning Language Modelling

Learning Situation Hyper-Graphs for Video Question Answering

1 code implementation CVPR 2023 Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah

The proposed method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian matching loss for the situation graph prediction.

Ranked #6 on Video Question Answering on AGQA 2.0 balanced (Average Accuracy metric)

Question Answering Video Question Answering +1

Hyper-Decision Transformer for Efficient Online Policy Adaptation

no code implementations17 Apr 2023 Mengdi Xu, Yuchen Lu, Yikang Shen, Shun Zhang, Ding Zhao, Chuang Gan

To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data- and parameter-efficient manner.

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

no code implementations7 Apr 2023 Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program.

Instruction Following Self-Supervised Learning

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

1 code implementation CVPR 2023 Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.

Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

no code implementations CVPR 2023 Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan

Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound.

DexDeform: Dexterous Deformable Object Manipulation with Human Demonstrations and Differentiable Physics

no code implementations27 Mar 2023 Sizhe Li, Zhiao Huang, Tao Chen, Tao Du, Hao Su, Joshua B. Tenenbaum, Chuang Gan

Reinforcement learning approaches for dexterous rigid object manipulation would struggle in this setting due to the complexity of physics interaction with deformable objects.

Deformable Object Manipulation Object

3D Concept Learning and Reasoning from Multi-View Images

no code implementations CVPR 2023 Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan

We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations.

Question Answering Visual Question Answering +1

PAC-NeRF: Physics Augmented Continuum Neural Radiance Fields for Geometry-Agnostic System Identification

no code implementations9 Mar 2023 Xuan Li, Yi-Ling Qiao, Peter Yichen Chen, Krishna Murthy Jatavallabhula, Ming Lin, Chenfanfu Jiang, Chuang Gan

In this work, we aim to identify parameters characterizing a physical system from a set of multi-view videos without any assumption on object geometry or topology.

Neural Rendering Object

Planning with Large Language Models for Code Generation

no code implementations9 Mar 2023 Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process.

Code Generation Language Modelling +1

FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation

1 code implementation4 Mar 2023 Zhou Xian, Bo Zhu, Zhenjia Xu, Hsiao-Yu Tung, Antonio Torralba, Katerina Fragkiadaki, Chuang Gan

We identify several challenges for fluid manipulation learning by evaluating a set of reinforcement learning and trajectory optimization methods on our platform.

Benchmarking

EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction

no code implementations ICCV 2023 Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

Without performance loss on Cityscapes, our EfficientViT provides up to 8. 8x and 3. 8x GPU latency reduction over SegFormer and SegNeXt, respectively.

Autonomous Driving Super-Resolution

EC2: Emergent Communication for Embodied Control

no code implementations CVPR 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Contrastive Learning Language Modelling

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

no code implementations15 Dec 2022 Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik Learned-Miller, Chuang Gan

To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad').

Multi-Task Learning

CLAWSAT: Towards Both Robust and Accurate Code Models

1 code implementation21 Nov 2022 Jinghan Jia, Shashank Srikant, Tamara Mitrovska, Chuang Gan, Shiyu Chang, Sijia Liu, Una-May O'Reilly

We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models.

Code Generation Code Summarization +2

Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation

no code implementations27 Oct 2022 Xingyu Lin, Carl Qi, Yunchu Zhang, Zhiao Huang, Katerina Fragkiadaki, Yunzhu Li, Chuang Gan, David Held

Effective planning of long-horizon deformable object manipulation requires suitable abstractions at both the spatial and temporal levels.

Deformable Object Manipulation

JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions

1 code implementation18 Oct 2022 Mo Yu, Yi Gu, Xiaoxiao Guo, Yufei Feng, Xiaodan Zhu, Michael Greenspan, Murray Campbell, Chuang Gan

Hence, in order to achieve higher performance on our tasks, models need to effectively utilize such functional knowledge to infer the outcomes of actions, rather than relying solely on memorizing facts.

Reading Comprehension

Revisiting the Roles of "Text" in Text Games

no code implementations15 Oct 2022 Yi Gu, Shunyu Yao, Chuang Gan, Joshua B. Tenenbaum, Mo Yu

Text games present opportunities for natural language understanding (NLU) methods to tackle reinforcement learning (RL) challenges.

Natural Language Understanding Passage Retrieval +2

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

1 code implementation14 Oct 2022 Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.

Navigate Vision and Language Navigation

Learning Active Camera for Multi-Object Navigation

no code implementations14 Oct 2022 Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

How to make robots perceive the environment as efficiently as humans is a fundamental problem in robotics.

Navigate Object

On the Forward Invariance of Neural ODEs

no code implementations10 Oct 2022 Wei Xiao, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Yutong Ban, Chuang Gan, Daniela Rus

We propose a new method to ensure neural ordinary differential equations (ODEs) satisfy output specifications by using invariance set propagation.

Autonomous Vehicles Collision Avoidance +2

Gait Recognition in the Wild with Multi-hop Temporal Switch

1 code implementation1 Sep 2022 Jinkai Zheng, Xinchen Liu, Xiaoyan Gu, Yaoqi Sun, Chuang Gan, Jiyong Zhang, Wu Liu, Chenggang Yan

Current methods that obtain state-of-the-art performance on in-the-lab benchmarks achieve much worse accuracy on the recently proposed in-the-wild datasets because these methods can hardly model the varied temporal dynamics of gait sequences in unconstrained scenes.

Gait Recognition in the Wild

Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation

1 code implementation22 Jul 2022 Hongbin Lin, Yifan Zhang, Zhen Qiu, Shuaicheng Niu, Chuang Gan, Yanxia Liu, Mingkui Tan

2) Prototype-based alignment and replay: based on the identified label prototypes, we align both domains and enforce the model to retain previous knowledge.

Unsupervised Domain Adaptation

3D Concept Grounding on Neural Fields

no code implementations13 Jul 2022 Yining Hong, Yilun Du, Chunru Lin, Joshua B. Tenenbaum, Chuang Gan

Experimental results show that our proposed framework outperforms unsupervised/language-mediated segmentation models on semantic and instance segmentation tasks, as well as outperforms existing models on the challenging 3D aware visual reasoning tasks.

Instance Segmentation Question Answering +3

Weakly Supervised Grounding for VQA in Vision-Language Transformers

1 code implementation5 Jul 2022 Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels da Vitoria Lobo, Mubarak Shah

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding.

Question Answering Representation Learning +1

On-Device Training Under 256KB Memory

1 code implementation30 Jun 2022 Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, Song Han

To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors.

Quantization Transfer Learning

SNAKE: Shape-aware Neural 3D Keypoint Field

1 code implementation3 Jun 2022 Chengliang Zhong, Peixing You, Xiaoxue Chen, Hao Zhao, Fuchun Sun, Guyue Zhou, Xiaodong Mu, Chuang Gan, Wenbing Huang

Detecting 3D keypoints from point clouds is important for shape reconstruction, while this work investigates the dual question: can shape reconstruction benefit 3D keypoint detection?

Keypoint Detection

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

5 code implementations29 May 2022 Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

Without performance loss on Cityscapes, our EfficientViT provides up to 13. 9$\times$ and 6. 2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively.

Autonomous Driving Image Classification +7

Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics

no code implementations ICLR 2022 Sizhe Li, Zhiao Huang, Tao Du, Hao Su, Joshua B. Tenenbaum, Chuang Gan

Extensive experimental results suggest that: 1) on multi-stage tasks that are infeasible for the vanilla differentiable physics solver, our approach discovers contact points that efficiently guide the solver to completion; 2) on tasks where the vanilla solver performs sub-optimally or near-optimally, our contact point discovery method performs better than or on par with the manipulation performance obtained with handcrafted contact points.

Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

no code implementations CVPR 2022 Yining Hong, Kaichun Mo, Li Yi, Leonidas J. Guibas, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

Specifically, FixNet consists of a perception module to extract the structured representation from the 3D point cloud, a physical dynamics prediction module to simulate the results of interactions on 3D objects, and a functionality prediction module to evaluate the functionality and choose the correct fix.

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

no code implementations ICLR 2022 Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset.

Learning Neural Acoustic Fields

1 code implementation4 Apr 2022 Andrew Luo, Yilun Du, Michael J. Tarr, Joshua B. Tenenbaum, Antonio Torralba, Chuang Gan

By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds.

FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations

no code implementations ICLR 2022 Lingjie Mei, Jiayuan Mao, Ziqi Wang, Chuang Gan, Joshua B. Tenenbaum

We present a meta-learning framework for learning new visual concepts quickly, from just one or a few examples, guided by multiple naturally occurring data streams: simultaneously looking at images, reading sentences that describe the objects in the scene, and interpreting supplemental sentences that relate the novel concept with other concepts.

Meta-Learning Novel Concepts +1

Linking Emergent and Natural Languages via Corpus Transfer

1 code implementation ICLR 2022 Shunyu Yao, Mo Yu, Yang Zhang, Karthik R Narasimhan, Joshua B. Tenenbaum, Chuang Gan

In this work, we propose a novel way to establish such a link by corpus transfer, i. e. pretraining on a corpus of emergent language for downstream natural language tasks, which is in contrast to prior work that directly transfers speaker and listener parameters.

Attribute Disentanglement +2

AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation

1 code implementation CVPR 2022 Xueyi Liu, Xiaomeng Xu, Anyi Rao, Chuang Gan, Li Yi

To solve the above issues, we propose AutoGPart, a generic method enabling training generalizable 3D part segmentation networks with the task prior considered.

3D Part Segmentation Domain Generalization +1

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

no code implementations NeurIPS 2021 Yining Hong, Li Yi, Joshua B. Tenenbaum, Antonio Torralba, Chuang Gan

A critical aspect of human visual perception is the ability to parse visual scenes into individual objects and further into object parts, forming part-whole hierarchies.

Instance Segmentation Object +2

STAR: A Benchmark for Situated Reasoning in Real-World Videos

1 code implementation NeurIPS 2021 Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan

This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR).

Logical Reasoning Question Answering

Memory-efficient Patch-based Inference for Tiny Deep Learning

no code implementations NeurIPS 2021 Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.

Image Classification Neural Architecture Search +3

Graph Convolutional Module for Temporal Action Localization in Videos

no code implementations1 Dec 2021 Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan

To this end, we propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods, including two-stage and one-stage paradigms.

Ranked #2 on Temporal Action Localization on THUMOS’14 (mAP IOU@0.1 metric)

Action Recognition Temporal Action Localization

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

no code implementations NeurIPS 2021 Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.

counterfactual Visual Reasoning

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

1 code implementation28 Oct 2021 Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.

Image Classification Neural Architecture Search +3

Network Augmentation for Tiny Deep Learning

no code implementations ICLR 2022 Han Cai, Chuang Gan, Ji Lin, Song Han

We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks.

Data Augmentation Image Classification +2

OPEn: An Open-ended Physics Environment for Learning Without a Task

1 code implementation13 Oct 2021 Chuang Gan, Abhishek Bhandwaldar, Antonio Torralba, Joshua B. Tenenbaum, Phillip Isola

We test several existing RL-based exploration methods on this benchmark and find that an agent using unsupervised contrastive learning for representation learning, and impact-driven learning for exploration, achieved the best results.

Contrastive Learning Representation Learning

Inducing Reusable Skills From Demonstrations with Option-Controller Network

no code implementations29 Sep 2021 Siyuan Zhou, Yikang Shen, Yuchen Lu, Aaron Courville, Joshua B. Tenenbaum, Chuang Gan

With the isolation of information and the synchronous calling mechanism, we can impose a division of works between the controller and options in an end-to-end training regime.

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

4 code implementations27 Sep 2021 Ji Lin, Chuang Gan, Kuan Wang, Song Han

Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8.

Video Recognition Video Understanding

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

1 code implementation2 Aug 2021 Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu

By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery.

Cross-Modal Retrieval Representation Learning +1

Certifiably Robust Interpretation via Renyi Differential Privacy

no code implementations4 Jul 2021 Ao Liu, Xiaoyu Chen, Sijia Liu, Lirong Xia, Chuang Gan

The advantages of our Renyi-Robust-Smooth (RDP-based interpretation method) are three-folds.

Computational Efficiency

Global Rhythm Style Transfer Without Text Transcriptions

1 code implementation16 Jun 2021 Kaizhi Qian, Yang Zhang, Shiyu Chang, JinJun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions.

Representation Learning Style Transfer

Temporal and Object Quantification Networks

no code implementations10 Jun 2021 Jiayuan Mao, Zhezheng Luo, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu, Leslie Pack Kaelbling, Tomer D. Ullman

We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events.

Object Temporal Sequences

Adversarial Option-Aware Hierarchical Imitation Learning

1 code implementation10 Jun 2021 Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan, Lei LI

In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent.

Imitation Learning

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

1 code implementation CVPR 2021 Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah

In this paper, we focus on a more relaxed setting: the grounding of relevant visual entities in a weakly supervised manner by training on the VQA task alone.

Question Answering Visual Question Answering

PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics

1 code implementation ICLR 2021 Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B. Tenenbaum, Chuang Gan

Experimental results suggest that 1) RL-based approaches struggle to solve most of the tasks efficiently; 2) gradient-based approaches, by optimizing open-loop control sequences with the built-in differentiable physics engine, can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning.

Reinforcement Learning (RL)

TransCenter: Transformers with Dense Representations for Multiple-Object Tracking

2 code implementations28 Mar 2021 Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, Xavier Alameda-Pineda

Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN).

Ranked #13 on Multi-Object Tracking on MOT20 (using extra training data)

Image Classification Multi-Object Tracking +4

Learning Task Decomposition with Ordered Memory Policy Network

no code implementations19 Mar 2021 Yuchen Lu, Yikang Shen, Siyuan Zhou, Aaron Courville, Joshua B. Tenenbaum, Chuang Gan

The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstruc-tured demonstration.

Inductive Bias

AGENT: A Benchmark for Core Psychological Reasoning

no code implementations24 Feb 2021 Tianmin Shu, Abhishek Bhandwaldar, Chuang Gan, Kevin A. Smith, Shari Liu, Dan Gutfreund, Elizabeth Spelke, Joshua B. Tenenbaum, Tomer D. Ullman

For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life.

Core Psychological Reasoning

On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning

1 code implementation ICLR 2021 Ren Wang, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Chuang Gan, Meng Wang

Despite the generalization power of the meta-model, it remains elusive that how adversarial robustness can be maintained by MAML in few-shot learning.

Adversarial Attack Adversarial Robustness +3

Temporal and Object Quantification Nets

no code implementations1 Jan 2021 Jiayuan Mao, Zhezheng Luo, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu, Leslie Pack Kaelbling, Tomer Ullman

We aim to learn generalizable representations for complex activities by quantifying over both entities and time, as in “the kicker is behind all the other players,” or “the player controls the ball until it moves toward the goal.” Such a structural inductive bias of object relations, object quantification, and temporal orders will enable the learned representation to generalize to situations with varying numbers of agents, objects, and time courses.

Event Detection Inductive Bias +1

Object-Centric Diagnosis of Visual Reasoning

no code implementations21 Dec 2020 Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox, Joshua B. Tenenbaum, Chuang Gan

In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy.

Object Question Answering +2

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

3 code implementations13 Dec 2020 Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding

Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance.

Action Classification Action Recognition +2

Synthetic Training for Monocular Human Mesh Recovery

no code implementations27 Oct 2020 Yu Sun, Qian Bao, Wu Liu, Wenpeng Gao, Yili Fu, Chuang Gan, Tao Mei

To solve this problem, we design a multi-branch framework to disentangle the regression of different body properties, enabling us to separate each component's training in a synthetic training manner using unpaired data available.

Computational Efficiency Human Mesh Recovery

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

1 code implementation27 Oct 2020 Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, Chuang Gan

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition.

Representation Learning Retrieval +2

Location-aware Graph Convolutional Networks for Video Question Answering

1 code implementation7 Aug 2020 Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan

In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction.

Action Recognition graph construction +3

Noisy Agents: Self-supervised Exploration by Predicting Auditory Events

no code implementations27 Jul 2020 Chuang Gan, Xiaoyu Chen, Phillip Isola, Antonio Torralba, Joshua B. Tenenbaum

Humans integrate multiple sensory modalities (e. g. visual and audio) to build a causal understanding of the physical world.

Atari Games Reinforcement Learning (RL)

TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning

1 code implementation NeurIPS 2020 Han Cai, Chuang Gan, Ligeng Zhu, Song Han

Furthermore, combined with feature extractor adaptation, TinyTL provides 7. 3-12. 9x memory saving without sacrificing accuracy compared to fine-tuning the full Inception-V3.

Transfer Learning

Foley Music: Learning to Generate Music from Videos

no code implementations ECCV 2020 Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments.

Music Generation Translation

MCUNet: Tiny Deep Learning on IoT Devices

1 code implementation NeurIPS 2020 Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han

Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones.

BIG-bench Machine Learning Neural Architecture Search +1

Generating Visually Aligned Sound from Videos

1 code implementation14 Jul 2020 Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan

During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features.

Language Guided Networks for Cross-modal Moment Retrieval

no code implementations18 Jun 2020 Kun Liu, Huadong Ma, Chuang Gan

In this paper, we present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.

Moment Retrieval Retrieval +3

A Real-time Action Representation with Temporal Encoding and Deep Compression

no code implementations17 Jun 2020 Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, Chuang Gan

Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5. 4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.

Action Recognition

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

4 code implementations ACL 2020 Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han

To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search.

Machine Translation Neural Architecture Search +1

Deep Audio Priors Emerge From Harmonic Convolutional Networks

no code implementations ICLR 2020 Zhoutong Zhang, Yunyun Wang, Chuang Gan, Jiajun Wu, Joshua B. Tenenbaum, Antonio Torralba, William T. Freeman

We show that networks using Harmonic Convolution can reliably model audio priors and achieve high performance in unsupervised audio restoration tasks.

Once for All: Train One Network and Specialize it for Efficient Deployment

1 code implementation ICLR 2020 Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

Most of the traditional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally expensive and unscalable.

Neural Architecture Search

Dense Regression Network for Video Grounding

1 code implementation CVPR 2020 Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, Chuang Gan

The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.

Natural Language Moment Retrieval Natural Language Queries +2

Visual Concept-Metaconcept Learning

1 code implementation NeurIPS 2019 Chi Han, Jiayuan Mao, Chuang Gan, Joshua B. Tenenbaum, Jiajun Wu

Humans reason with concepts and metaconcepts: we recognize red and green from visual input; we also understand that they describe the same property of objects (i. e., the color).

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

1 code implementation25 Dec 2019 Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, Joshua B. Tenenbaum

In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data.

Navigate

Cross-channel Communication Networks

1 code implementation NeurIPS 2019 Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh

Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers.

Self-supervised Moving Vehicle Tracking with Stereo Sound

no code implementations ICCV 2019 Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba

At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input.

Object Localization Visual Localization

TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation

no code implementations14 Oct 2019 Fan Yang, Xiao Liu, Dongliang He, Chuang Gan, Jian Wang, Chao Li, Fu Li, Shilei Wen

In this work, we introduce a new problem, named as {\em story-preserving long video truncation}, that requires an algorithm to automatically truncate a long-duration video into multiple short and attractive sub-videos with each one containing an unbroken story.

Highlight Detection Video Summarization

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

3 code implementations ICLR 2020 Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum

While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations.

counterfactual Descriptive +1

Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

no code implementations1 Oct 2019 Ji Lin, Chuang Gan, Song Han

With such hardware-aware model design, we are able to scale up the training on Summit supercomputer and reduce the training time on Kinetics dataset from 49 hours 55 minutes to 14 minutes 13 seconds, achieving a top-1 accuracy of 74. 0%, which is 1. 6x and 2. 9x faster than previous 3D video models with higher accuracy.

Video Recognition

Graph Convolutional Networks for Temporal Action Localization

1 code implementation ICCV 2019 Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, Chuang Gan

Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.

Ranked #4 on Temporal Action Localization on THUMOS’14 (mAP IOU@0.1 metric)

Action Classification Temporal Action Localization

Deep Concept-wise Temporal Convolutional Networks for Action Localization

2 code implementations26 Aug 2019 Xin Li, Tianwei Lin, Xiao Liu, Chuang Gan, WangMeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen

In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution.

Action Classification Action Localization

Once-for-All: Train One Network and Specialize it for Efficient Deployment

10 code implementations26 Aug 2019 Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han

On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4. 0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1. 5x faster than MobileNetV3, 2. 6x faster than EfficientNet w. r. t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission.

Neural Architecture Search

Self-Supervised Audio-Visual Co-Segmentation

no code implementations18 Apr 2019 Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh Mcdermott, Antonio Torralba

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data.

Image Segmentation Segmentation +1

Defensive Quantization: When Efficiency Meets Robustness

no code implementations ICLR 2019 Ji Lin, Chuang Gan, Song Han

This paper aims to raise people's awareness about the security of the quantized models, and we designed a novel quantization methodology to jointly optimize the efficiency and robustness of deep learning models.

Adversarial Attack Quantization

The Sound of Motions

1 code implementation ICCV 2019 Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba

Sounds originate from object motions and vibrations of surrounding air.

Interpreting Adversarial Examples by Activation Promotion and Suppression

no code implementations3 Apr 2019 Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, Xue Lin

It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: images with imperceptible perturbations crafted to fool classifiers.

Adversarial Robustness

TSM: Temporal Shift Module for Efficient Video Understanding

13 code implementations ICCV 2019 Ji Lin, Chuang Gan, Song Han

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.

3D Action Recognition Action Classification +6

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

8 code implementations5 Nov 2018 Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Li-Min Wang, Shilei Wen

In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos.

Action Recognition Temporal Action Localization

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

2 code implementations NeurIPS 2018 Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, Joshua B. Tenenbaum

Second, the model is more data- and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering.

Question Answering Representation Learning +1

Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation

no code implementations9 Aug 2018 Lijie Fan, Wenbing Huang, Chuang Gan, Junzhou Huang, Boqing Gong

The recent advances in deep learning have made it possible to generate photo-realistic images by using neural networks and even to extrapolate video frames from an input video clip.

Facial expression generation Image-to-Image Translation +2

Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning

no code implementations CVPR 2018 Chuang Gan, Boqing Gong, Kun Liu, Hao Su, Leonidas J. Guibas

In addition, we also find that a progressive training strategy can foster a better neural network for the video recognition task than blindly pooling the distinct sources of geometry cues together.

Action Recognition Representation Learning +5

The Sound of Pixels

2 code implementations ECCV 2018 Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh Mcdermott, Antonio Torralba

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.

End-to-End Learning of Motion Representation for Video Understanding

1 code implementation CVPR 2018 Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, Junzhou Huang

Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks.

Action Recognition Optical Flow Estimation +1

Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency

1 code implementation ECCV 2018 Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, Qi-Xing Huang

In this paper, we introduce a novel unsupervised domain adaptation technique for the task of 3D keypoint prediction from a single depth scan or image.

Keypoint Estimation Unsupervised Domain Adaptation

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

5 code implementations CVPR 2018 Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen

In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets.

Classification General Classification +1

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

1 code implementation ICCV 2017 Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong

Many seemingly distant annotations (e. g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e. g., of COCO).

Language Modelling Multiple-choice +4

Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification

no code implementations12 Aug 2017 Yunlong Bian, Chuang Gan, Xiao Liu, Fu Li, Xiang Long, Yandong Li, Heng Qi, Jie zhou, Shilei Wen, Yuanqing Lin

Experiment results on the challenging Kinetics dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks.

Action Classification General Classification +2

Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

1 code implementation14 Jul 2017 Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie zhou, Shilei Wen

This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place.

Video Recognition Video Understanding

StyleNet: Generating Attractive Visual Captions With Styles

no code implementations CVPR 2017 Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng

We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles.

Caption Generation

Recurrent Topic-Transition GAN for Visual Paragraph Generation

no code implementations ICCV 2017 Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, Eric P. Xing

The proposed Recurrent Topic-Transition Generative Adversarial Network (RTT-GAN) builds an adversarial framework between a structured paragraph generator and multi-level paragraph discriminators.

Generative Adversarial Network Image Paragraph Captioning +1

Video Captioning with Multi-Faceted Attention

no code implementations TACL 2018 Xiang Long, Chuang Gan, Gerard de Melo

Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval.

Information Retrieval Retrieval +2

Semantic Compositional Networks for Visual Captioning

1 code implementation CVPR 2017 Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng

The degree to which each member of the ensemble is used to generate an image caption is tied to the image-dependent probability of the corresponding tag.

Image Captioning Semantic Composition +1

Strategies for Searching Video Content with Text Queries or Video Examples

no code implementations17 Jun 2016 Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann, Chuang Gan, Xingzhong Du, Xiaojun Chang

The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search.

Event Detection Retrieval +1

You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images

no code implementations CVPR 2016 Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, Tao Mei

The Web images are then filtered by the learnt network and the selected images are additionally fed into the network to enhance the architecture and further trim the videos.

Action Recognition Event Detection +1

Learning Attributes Equals Multi-Source Domain Generalization

no code implementations CVPR 2016 Chuang Gan, Tianbao Yang, Boqing Gong

Attributes possess appealing properties and benefit many computer vision problems, such as object recognition, learning with humans in the loop, and image retrieval.

Attribute Domain Generalization +3

DevNet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting

no code implementations CVPR 2015 Chuang Gan, Naiyan Wang, Yi Yang, Dit-yan Yeung, Alex G. Hauptmann

Taking key frames of videos as input, we first detect the event of interest at the video level by aggregating the CNN features of the key frames.

Action Recognition Event Detection +2

Cannot find the paper you are looking for? You can Submit a new open access paper.