no code implementations • 2 Aug 2024 • Zhenfang Chen, Shilong Dong, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan
The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions.
no code implementations • 29 Jul 2024 • Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining Hong, Zhenfang Chen, Yikang Shen, Chuang Gan
Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost.
no code implementations • NeurIPS 2021 • Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan
This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark).
no code implementations • CVPR 2024 • Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan
Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence.
no code implementations • 9 Feb 2024 • Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, Chuang Gan
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset.
no code implementations • 30 Jan 2024 • Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
no code implementations • 8 Nov 2023 • Zhenfang Chen, Rui Sun, Wenjun Liu, Yining Hong, Chuang Gan
If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module.
no code implementations • 6 Nov 2023 • Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan
A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far.
2 code implementations • 11 Oct 2023 • Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan
The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.
no code implementations • ICCV 2023 • Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan
To tackle this problem, we propose a new framework TextPSG consisting of four modules, i. e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques.
1 code implementation • 9 Oct 2023 • Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan
Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents.
5 code implementations • NeurIPS 2023 • Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan
Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs.
1 code implementation • 7 Jun 2023 • Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan
In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.
1 code implementation • NeurIPS 2023 • Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan
Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable.
no code implementations • 7 Apr 2023 • Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan
ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program.
1 code implementation • CVPR 2023 • Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan
When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.
no code implementations • CVPR 2023 • Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan
We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations.
1 code implementation • 9 Mar 2023 • Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan
Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process.
1 code implementation • 12 Jan 2023 • Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang, Chuang Gan
The see stage scans the image and grounds the visual concept candidates with a visual perception model.
no code implementations • CVPR 2023 • Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik G. Learned-Miller, Chuang Gan
To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad').
no code implementations • 15 Dec 2022 • Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik Learned-Miller, Chuang Gan
To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad').
no code implementations • 17 Oct 2022 • Wenqi Yang, GuanYing Chen, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong
Different from existing single-view methods which can only recover a 2. 5D scene representation (i. e., a normal / depth map for the visible surface), our method learns a neural reflectance field to represent the 3D geometry and BRDFs of a scene.
no code implementations • 23 Jul 2022 • Wenqi Yang, GuanYing Chen, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong
It then jointly optimizes the surface normals, spatially-varying BRDFs, and lights based on a shadow-aware differentiable rendering layer.
no code implementations • ICLR 2022 • Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan
In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset.
1 code implementation • 15 Feb 2022 • Shaozhe Hao, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong
We introduce rectification blocks to rectify features extracted by a state-of-the-art recognition model, in both spatial and channel dimensions, to minimize the distance between a masked face and its mask-free counterpart in the rectified feature space.
no code implementations • NeurIPS 2021 • Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, Chuang Gan
This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
no code implementations • 2 Sep 2021 • Wenqi Yang, Zhenfang Chen, Chaofeng Chen, GuanYing Chen, Kwan-Yee K. Wong
In Stage I, we perform face inpainting in the UV space.
no code implementations • 30 Mar 2021 • Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee Kenneth Wong, Joshua B. Tenenbaum, Chuang Gan
We study the problem of dynamic visual reasoning on raw videos.
1 code implementation • CVPR 2021 • YuAn Liu, Jingyuan Chen, Zhenfang Chen, Bing Deng, Jianqiang Huang, Hanwang Zhang
The key challenge is how to distinguish the action of interest segments from the background, which is unlabelled even on the video-level.
Weakly-supervised Temporal Action Localization Weakly Supervised Temporal Action Localization
no code implementations • ICLR 2021 • Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee Kenneth Wong, Joshua B. Tenenbaum, Chuang Gan
We study the problem of dynamic visual reasoning on raw videos.
no code implementations • CVPR 2020 • Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K. Wong, Qi Wu
To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
no code implementations • 25 Jan 2020 • Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video.
1 code implementation • ACL 2019 • Zhenfang Chen, Lin Ma, Wenhan Luo, Kwan-Yee K. Wong
In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video.
no code implementations • 10 May 2018 • Xiaoyu Yue, Zhanghui Kuang, Zhaoyang Zhang, Zhenfang Chen, Pan He, Yu Qiao, Wei zhang
Deep CNNs have achieved great success in text detection.