Search Results for author: Zhenfang Chen

Found 34 papers, 11 papers with code

Compositional Physical Reasoning of Objects and Events from Videos

no code implementations2 Aug 2024 Zhenfang Chen, Shilong Dong, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions.

counterfactual Question Answering

FlexAttention for Efficient High-Resolution Vision-Language Models

no code implementations29 Jul 2024 Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining Hong, Zhenfang Chen, Yikang Shen, Chuang Gan

Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost.

STAR: A Benchmark for Situated Reasoning in Real-World Videos

no code implementations NeurIPS 2021 Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan

This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark).

Logical Reasoning Question Answering

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

no code implementations9 Feb 2024 Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, Chuang Gan

We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset.

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

no code implementations30 Jan 2024 Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.

Language Modelling Large Language Model +1

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

no code implementations6 Nov 2023 Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan

A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far.

CoLA Question Answering +5

Sparse Universal Transformer

2 code implementations11 Oct 2023 Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers.

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

no code implementations ICCV 2023 Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan

To tackle this problem, we propose a new framework TextPSG consisting of four modules, i. e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques.

Graph Generation Panoptic Scene Graph Generation +1

SALMON: Self-Alignment with Instructable Reward Models

1 code implementation9 Oct 2023 Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents.

In-Context Learning Language Modelling

ModuleFormer: Modularity Emerges from Mixture-of-Experts

1 code implementation7 Jun 2023 Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan

In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.

Language Modelling

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

1 code implementation NeurIPS 2023 Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable.

Diversity In-Context Learning +1

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

no code implementations7 Apr 2023 Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program.

Instruction Following Self-Supervised Learning

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

1 code implementation CVPR 2023 Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.

3D Concept Learning and Reasoning from Multi-View Images

no code implementations CVPR 2023 Yining Hong, Chunru Lin, Yilun Du, Zhenfang Chen, Joshua B. Tenenbaum, Chuang Gan

We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations.

Question Answering Visual Question Answering +1

Planning with Large Language Models for Code Generation

1 code implementation9 Mar 2023 Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan

Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process.

Code Generation Language Modelling +1

Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners

no code implementations15 Dec 2022 Zitian Chen, Yikang Shen, Mingyu Ding, Zhenfang Chen, Hengshuang Zhao, Erik Learned-Miller, Chuang Gan

To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad').

Multi-Task Learning

S$^3$-NeRF: Neural Reflectance Field from Shading and Shadow under a Single Viewpoint

no code implementations17 Oct 2022 Wenqi Yang, GuanYing Chen, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong

Different from existing single-view methods which can only recover a 2. 5D scene representation (i. e., a normal / depth map for the visible surface), our method learns a neural reflectance field to represent the 3D geometry and BRDFs of a scene.

3D geometry Novel View Synthesis

PS-NeRF: Neural Inverse Rendering for Multi-view Photometric Stereo

no code implementations23 Jul 2022 Wenqi Yang, GuanYing Chen, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong

It then jointly optimizes the surface normals, spatially-varying BRDFs, and lights based on a shadow-aware differentiable rendering layer.

Inverse Rendering Neural Rendering

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

no code implementations ICLR 2022 Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset.

A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification

1 code implementation15 Feb 2022 Shaozhe Hao, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong

We introduce rectification blocks to rectify features extracted by a state-of-the-art recognition model, in both spatial and channel dimensions, to minimize the distance between a masked face and its mask-free counterpart in the rectified feature space.

Face Recognition

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

no code implementations NeurIPS 2021 Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.

counterfactual Visual Reasoning

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

no code implementations25 Jan 2020 Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video.

Sentence

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

1 code implementation ACL 2019 Zhenfang Chen, Lin Ma, Wenhan Luo, Kwan-Yee K. Wong

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video.

Diversity object-detection +2

Cannot find the paper you are looking for? You can Submit a new open access paper.