Search Results for author: Peihao Chen

Found 19 papers, 11 papers with code

3D-VLA: A 3D Vision-Language-Action Generative World Model

no code implementations14 Mar 2024 Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.

Language Modelling Large Language Model +1

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

no code implementations16 Jan 2024 Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world.

Language Modelling Large Language Model

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

1 code implementation14 Dec 2023 Shuailei Ma, Yuefeng Wang, Ying WEI, Jiaqi Fan, Enming Zhang, Xinyu Sun, Peihao Chen

Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects.

Knowledge Distillation Object +3

DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning

no code implementations10 Dec 2023 Kunyang Lin, Yufeng Wang, Peihao Chen, Runhao Zeng, Siyuan Zhou, Mingkui Tan, Chuang Gan

In this paper, we propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents by utilizing intrinsic rewards to learn the optimal policy for each agent.

Multi-agent Reinforcement Learning reinforcement-learning +2

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

no code implementations6 Nov 2023 Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan

A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far.

CoLA Question Answering +5

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

no code implementations15 Aug 2023 Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H. Li, Gaowen Liu, Mingkui Tan, Chuang Gan

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data.

Navigate Robot Navigation +1

Learning Vision-and-Language Navigation from YouTube Videos

1 code implementation ICCV 2023 Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.

Navigate Vision and Language Navigation

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

1 code implementation20 Jul 2023 Weidong Chen, Xiaofen Xing, Peihao Chen, Xiangmin Xu

Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved.

Speech Emotion Recognition

Detecting the open-world objects with the help of the Brain

1 code implementation21 Mar 2023 Shuailei Ma, Yuefeng Wang, Ying WEI, Peihao Chen, Zhixiang Ye, Jiaqi Fan, Enming Zhang, Thomas H. Li

We propose leveraging the VL as the ``Brain'' of the open-world detector by simply generating unknown labels.

Object object-detection +1

Learning Active Camera for Multi-Object Navigation

no code implementations14 Oct 2022 Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, Thomas H. Li, Mingkui Tan, Chuang Gan

How to make robots perceive the environment as efficiently as humans is a fundamental problem in robotics.

Navigate Object

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

1 code implementation14 Oct 2022 Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan

To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.

Navigate Vision and Language Navigation

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

1 code implementation27 Oct 2020 Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, Chuang Gan

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition.

Representation Learning Retrieval +2

Location-aware Graph Convolutional Networks for Video Question Answering

1 code implementation7 Aug 2020 Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan

In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction.

Action Recognition graph construction +3

Foley Music: Learning to Generate Music from Videos

no code implementations ECCV 2020 Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba

In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments.

Music Generation Translation

Generating Visually Aligned Sound from Videos

1 code implementation14 Jul 2020 Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan

During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features.

Dense Regression Network for Video Grounding

1 code implementation CVPR 2020 Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, Chuang Gan

The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.

Natural Language Moment Retrieval Natural Language Queries +2

Self-supervised Moving Vehicle Tracking with Stereo Sound

no code implementations ICCV 2019 Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba

At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input.

Object Localization Visual Localization

Cannot find the paper you are looking for? You can Submit a new open access paper.