1 code implementation • 21 Mar 2023 • Shuailei Ma, Yuefeng Wang, Ying WEI, Peihao Chen, Zhixiang Ye, Jiaqi Fan, Enming Zhang, Thomas H. Li
We propose leveraging the VL as the ``Brain'' of the open-world detector by simply generating unknown labels.
no code implementations • 14 Oct 2022 • Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, Thomas H. Li, Mingkui Tan, Chuang Gan
How to make robots perceive the environment as efficiently as humans is a fundamental problem in robotics.
1 code implementation • 14 Oct 2022 • Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H. Li, Mingkui Tan, Chuang Gan
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
2 code implementations • 12 Oct 2022 • Xinyu Sun, Peihao Chen, LiangWei Chen, Changhao Li, Thomas H. Li, Mingkui Tan, Chuang Gan
The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions.
Ranked #1 on
Self-Supervised Action Recognition
on UCF101
1 code implementation • 27 Oct 2020 • Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, Chuang Gan
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition.
Ranked #7 on
Self-Supervised Action Recognition
on UCF101
1 code implementation • 7 Aug 2020 • Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan
In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction.
no code implementations • ECCV 2020 • Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments.
1 code implementation • 14 Jul 2020 • Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan
During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features.
1 code implementation • CVPR 2020 • Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, Chuang Gan
The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Natural Language Moment Retrieval
Natural Language Queries
+2
no code implementations • ICCV 2019 • Chuang Gan, Hang Zhao, Peihao Chen, David Cox, Antonio Torralba
At test time, the stereo-sound student network can work independently to perform object localization us-ing just stereo audio and camera meta-data, without any visual input.