no code implementations • 16 Jun 2025 • Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i. e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL).
no code implementations • CVPR 2025 • Muli Yang, Gabriel James Goenawan, Huaiyuan Qin, Kai Han, Xi Peng, Yanhua Yang, Hongyuan Zhu
In this paper, we propose Partial Attribute Assignment (PASS), aiming to automatically select and optimize a small, relevant subset of attributes from a large attribute pool.
1 code implementation • ACM Multimedia 2024 • Changhao He, Hongyuan Zhu, Peng Hu, Xi Peng
To address these problems, we propose a robust framework, dubbed VariatIonal ConTrAstive Learning (VITAL), designed to learn both common and specific information simultaneously.
no code implementations • 20 Oct 2024 • Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei
In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form.
no code implementations • 28 Mar 2024 • Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu
We observe that the data is challenging and with noisy correspondence due to the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which make existing cross-modal matching methods ineffective for PTM.
1 code implementation • 29 Jan 2024 • Zhijing Wan, Zhixiang Wang, Yuran Wang, Zheng Wang, Hongyuan Zhu, Shin'ichi Satoh
Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm.
no code implementations • 12 Jan 2024 • Jialiang Tang, Shuo Chen, Gang Niu, Hongyuan Zhu, Joey Tianyi Zhou, Chen Gong, Masashi Sugiyama
Then, we build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network, while simultaneously encouraging the adapter within the teacher network to learn the domain-specific knowledge of the target data.
1 code implementation • CVPR 2024 • Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen
Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions.
2 code implementations • 17 Dec 2023 • Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen
Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts.
1 code implementation • 30 Nov 2023 • Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen
However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene.
no code implementations • 10 Oct 2023 • Ke Xu, Jiangtao Wang, Hongyuan Zhu, Dingchang Zheng
We attribute this issue to the inappropriate alignment criteria, which disrupt the semantic distance consistency between the feature space and the input space.
no code implementations • 17 Sep 2023 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Many studies focus on improving pretraining or developing new backbones in text-video retrieval.
1 code implementation • 6 Sep 2023 • Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen
Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture.
Ranked #3 on
3D dense captioning
on Nr3D
no code implementations • 19 Jul 2023 • Ke Xu, Jiangtao Wang, Hongyuan Zhu, Dingchang Zheng
Therefore, considerable efforts have been made to address the challenge of insufficient data in deep learning by leveraging SSL algorithms.
no code implementations • 7 Jun 2023 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Text-video retrieval contains various challenges, including biases coming from diverse sources.
no code implementations • 20 Apr 2023 • Haoyang Peng, Baopu Li, Bo Zhang, Xin Chen, Tao Chen, Hongyuan Zhu
Then, a novel multi-view prompt fusion module is developed to effectively fuse information from different views to bridge the gap between 3D point cloud data and 2D pre-trained models.
1 code implementation • 31 Mar 2023 • Chuangguan Ye, Hongyuan Zhu, Bo Zhang, Tao Chen
In recent years, research on few-shot learning (FSL) has been fast-growing in the 2D image domain due to the less requirement for labeled training data and greater generalization for novel classes.
1 code implementation • 31 Mar 2023 • Chuangguan Ye, Hongyuan Zhu, Yongbin Liao, Yanggang Zhang, Tao Chen, Jiayuan Fan
Due to the emergence of powerful computing resources and large-scale annotated datasets, deep learning has seen wide applications in our daily life.
1 code implementation • CVPR 2023 • Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang Yu
Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner.
Ranked #4 on
3D dense captioning
on Nr3D
1 code implementation • CVPR 2023 • Yanglin Feng, Hongyuan Zhu, Dezhong Peng, Xi Peng, Peng Hu
Recently, with the advent of Metaverse and AI Generated Content, cross-modal retrieval becomes popular with a burst of 2D and 3D data.
1 code implementation • ICCV 2023 • Yuwei Yang, Munawar Hayat, Zhao Jin, Hongyuan Zhu, Yinjie Lei
Given only the class-level semantic information for unseen objects, we strive to enhance the correspondence, alignment and consistency between the visual and semantic spaces, to synthesise diverse, generic and transferable visual features.
no code implementations • CVPR 2023 • Yuanbiao Gou, Peng Hu, Jiancheng Lv, Hongyuan Zhu, Xi Peng
Existing studies have empirically observed that the resolution of the low-frequency region is easier to enhance than that of the high-frequency one.
1 code implementation • 29 Jun 2022 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
Ranked #11 on
Multi-Instance Retrieval
on EPIC-KITCHENS-100
1 code implementation • 26 Jun 2022 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality.
Ranked #12 on
Video Retrieval
on YouCook2
1 code implementation • 26 Jun 2022 • Burak Satar, Hongyuan Zhu, Xavier Bresson, Joo Hwee Lim
With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical.
Ranked #13 on
Video Retrieval
on YouCook2
no code implementations • 23 May 2022 • Peng Hu, Xi Peng, Hongyuan Zhu, Mohamed M. Sabry Aly, Jie Lin
Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e. g., pruning sparsity and quantization codebook) of each layer.
1 code implementation • CVPR 2022 • Xiuchao Sui, Shaohua Li, Xue Geng, Yan Wu, Xinxing Xu, Yong liu, Rick Goh, Hongyuan Zhu
This is mainly because the correlation volume, the basis of pixel matching, is computed as the dot product of the convolutional features of the two images.
Ranked #11 on
Optical Flow Estimation
on KITTI 2015 (train)
no code implementations • 13 Feb 2022 • En Yen Puang, Hao Zhang, Hongyuan Zhu, Wei Jing
In this paper we present SA-CNN, a hierarchical and lightweight self-attention based encoding and decoding architecture for representation learning of point cloud data.
1 code implementation • 30 Nov 2021 • Yongbin Liao, Hongyuan Zhu, Yanggang Zhang, Chuangguan Ye, Tao Chen, Jiayuan Fan
For stage two, the bounding box proposals with SPCR are grouped into some subsets, and the instance masks are mined inside each subset with a novel semantic propagation module and a property consistency graph module.
1 code implementation • CVPR 2021 • Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, Jie Lin
Recently, cross-modal retrieval is emerging with the help of deep multimodal learning.
no code implementations • 8 Mar 2021 • Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, Cheston Tan
This paper aims to provide an encyclopedic survey for the field of embodied AI, from its simulators to its research.
no code implementations • 11 Dec 2019 • Tianying Wang, Hao Zhang, Wei Qi Toh, Hongyuan Zhu, Cheston Tan, Yan Wu, Yong liu, Wei Jing
The proposed method is able to efficiently generalize the previously learned task by model fusion to solve the environment adaptation problem.
1 code implementation • NeurIPS 2019 • Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh
Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers.
no code implementations • 24 Sep 2019 • Yi Cheng, Hongyuan Zhu, Ying Sun, Cihan Acar, Wei Jing, Yan Wu, Liyuan Li, Cheston Tan, Joo-Hwee Lim
To our best knowledge, this is the first work to explore effective intra- and inter-modality fusion in 6D pose estimation.
no code implementations • ACL 2019 • Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu, Meng Fang, Rick Siow Mong Goh, Kenneth Kwok
We propose a new neural transfer method termed Dual Adversarial Transfer Network (DATNet) for addressing low-resource Named Entity Recognition (NER).
no code implementations • 21 May 2019 • Zhao Kang, Honghui Xu, Boyu Wang, Hongyuan Zhu, Zenglin Xu
A key step of graph-based approach is the similarity graph construction.
no code implementations • ICLR 2019 • Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu, Rick Siow Mong Goh, Kenneth Kwok
We propose a new architecture termed Dual Adversarial Transfer Network (DATNet) for addressing low-resource Named Entity Recognition (NER).
Low Resource Named Entity Recognition
named-entity-recognition
+2
no code implementations • 26 Jan 2019 • Changgong Zhang, Fangneng Zhan, Hongyuan Zhu, Shijian Lu
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique - the use of our synthesized images in deep network training is capable of achieving similar or even better scene text detection and scene text recognition performance as compared with using real images.
no code implementations • CVPR 2019 • Fangneng Zhan, Hongyuan Zhu, Shijian Lu
Recent advances in generative adversarial networks (GANs) have shown great potentials in realistic image synthesis whereas most existing works address synthesis realism in either appearance space or geometry space but few in both.
no code implementations • 12 Nov 2018 • Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, Vijay Chandrasekhar
In this paper, we present the Holistic Multi-modal Memory Network (HMMN) framework which fully considers the interactions between different input sources (multi-modal context, question) in each hop.
no code implementations • 22 Aug 2018 • Xi Peng, Yunnan Li, Ivor W. Tsang, Hongyuan Zhu, Jiancheng Lv, Joey Tianyi Zhou
The second is implementing discrete $k$-means with a differentiable neural network that embraces the advantages of parallel computing, online clustering, and clustering-favorable representation learning.
no code implementations • ICCV 2017 • Hongyuan Zhu, Romain Vial, Shijian Lu
Recently, the regression-based object detectors and long-term recurrent convolutional network (LRCN) have demonstrated superior performance in human action detection and recognition.
no code implementations • 26 Jun 2017 • Hongyuan Zhu, Romain Vial, Shijian Lu, Yonghong Tian, Xian-Bin Cao
In this paper, we present YoTube-a novel network fusion framework for searching action proposals in untrimmed videos, where each action proposal corresponds to a spatialtemporal video tube that potentially locates one human action.
1 code implementation • 17 Jun 2017 • Zhe Wang, Kingsley Kuan, Mathieu Ravaut, Gaurav Manek, Sibo Song, Yuan Fang, Seokhwan Kim, Nancy Chen, Luis Fernando D'Haro, Luu Anh Tuan, Hongyuan Zhu, Zeng Zeng, Ngai Man Cheung, Georgios Piliouras, Jie Lin, Vijay Chandrasekhar
Beyond that, we extend the original competition by including text information in the classification, making this a truly multi-modal approach with vision, audio and text.
no code implementations • CVPR 2016 • Hongyuan Zhu, Jean-Baptiste Weibel, Shijian Lu
RGBD scene recognition has attracted increasingly attention due to the rapid development of depth sensors and their wide application scenarios.
no code implementations • 16 Jul 2015 • Hongyuan Zhu, Shijian Lu, Jianfei Cai, Quangqing Lee
Recently, Hosang et al. conduct the first unified study of existing methods' in terms of various image-level degradations.
no code implementations • 3 Feb 2015 • Hongyuan Zhu, Fanman Meng, Jianfei Cai, Shijian Lu
Image segmentation refers to the process to divide an image into nonoverlapping meaningful regions according to human perception, which has become a classic topic since the early ages of computer vision.