1 code implementation • 20 Feb 2025 • Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang
From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels.
1 code implementation • 3 Dec 2024 • Guanghui Zhu, Zipeng Ji, Jingyan Chen, LiMin Wang, Chunfeng Yuan, Yihua Huang
GNAS (Graph Neural Architecture Search) has demonstrated great effectiveness in automatically designing the optimal graph neural architectures for multiple downstream tasks, such as node classification and link prediction.
no code implementations • 22 Nov 2024 • Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, Weiming Hu
Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope.
1 code implementation • 15 Nov 2024 • Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu
The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality.
no code implementations • 21 Oct 2024 • Shizhen Zhao, Xin Wen, Jiahui Liu, Chuofan Ma, Chunfeng Yuan, Xiaojuan Qi
To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset.
no code implementations • 21 Jul 2024 • Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images underexplored.
no code implementations • 10 Jul 2024 • Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, Weiming Hu
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events.
no code implementations • CVPR 2024 • Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu
Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency.
1 code implementation • 8 Mar 2024 • Zewen Chen, Haina Qin, Juan Wang, Chunfeng Yuan, Bing Li, Weiming Hu, Liang Wang
On the other hand, PromptIQA is trained on a mixed dataset with two proposed data augmentation strategies to learn diverse requirements, thus enabling it to effectively adapt to new requirements.
no code implementations • 1 Mar 2024 • Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu
In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment.
no code implementations • 26 Feb 2024 • Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu
In this work, we propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics and combines the strengths of latent and lexicon representations for video-text retrieval.
no code implementations • 19 Jan 2024 • Zewen Chen, Juan Wang, Bing Li, Chunfeng Yuan, Weiming Hu, Junxian Liu, Peng Li, Yan Wang, Youqun Zhang, Congxuan Zhang
Due to the subjective nature of image quality assessment (IQA), assessing which image has better quality among a sequence of images is more reliable than assigning an absolute mean opinion score for an image.
no code implementations • 25 Dec 2023 • Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Peng Li, Yan Wang, Bing Li, Weiming Hu
Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction.
1 code implementation • 15 Aug 2023 • Guanghui Zhu, Mengyu Chen, Chunfeng Yuan, Yihua Huang
To this end, we propose a totally new method named partial graph attack (PGA), which selects the vulnerable nodes as attack targets.
no code implementations • 4 Jul 2023 • Guanghui Zhu, Zhennan Zhu, Hongyang Chen, Chunfeng Yuan, Yihua Huang
Then, we propose a novel framework to utilize the rich type semantic information in heterogeneous graphs comprehensively, namely HAGNN (Hybrid Aggregation for Heterogeneous GNNs).
1 code implementation • 8 Jan 2023 • Guanghui Zhu, Zhennan Zhu, Wenjie Wang, Zhuoer Xu, Chunfeng Yuan, Yihua Huang
Moreover, to improve the performance of the downstream graph learning task, attribute completion and the training of the heterogeneous GNN should be jointly optimized rather than viewed as two separate processes.
no code implementations • ICCV 2023 • Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Yingmin Luo, Zekun Li, Chunfeng Yuan, Bing Li, XiaoHu Qie, Ying Shan, Weiming Hu
This paper proposes a novel generative model, Order-Prompted Tag Sequence Generation (OP-TSG), according to the above characteristics.
no code implementations • CVPR 2023 • Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Weiming Hu, XiaoHu Qie, Jianping Wu
ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information.
Ranked #45 on
Visual Reasoning
on Winoground
no code implementations • 21 Jul 2022 • Jingfan Chen, Wenqi Fan, Guanghui Zhu, Xiangyu Zhao, Chunfeng Yuan, Qing Li, Yihua Huang
Recent studies have shown that deep neural networks-based recommender systems are vulnerable to adversarial attacks, where attackers can inject carefully crafted fake user profiles (i. e., a set of items that fake users have interacted with) into a target recommender system to achieve malicious purposes, such as promote or demote a set of target items.
no code implementations • 6 Jul 2022 • Yifan Lu, Ziqi Zhang, Yuxin Chen, Chunfeng Yuan, Bing Li, Weiming Hu
The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video.
1 code implementation • CVPR 2022 • Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, Weiming Hu
They base the visual grounding on the features from pre-generated proposals or anchors, and fuse these features with the text embeddings to locate the target mentioned by the text.
no code implementations • 31 Mar 2022 • Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing Li, Ying Shan, Weiming Hu
In this paper, we propose to CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese.
no code implementations • 12 Mar 2022 • Guanghui Zhu, Haojun Hou, Jingfan Chen, Chunfeng Yuan, Yihua Huang
Specifically, TRASA first converts the session to a graph and then encodes the shortest path between items through the gated recurrent unit as their transition relation.
1 code implementation • CVPR 2022 • Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, Zheng-Jun Zha
The datasets will be released to facilitate the development of video captioning metrics.
2 code implementations • ICCV 2021 • Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, Weiming Hu
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition.
Ranked #11 on
Skeleton Based Action Recognition
on N-UCLA
1 code implementation • 28 Apr 2021 • Li Yang, Yan Xu, Shaoru Wang, Chunfeng Yuan, Ziqi Zhang, Bing Li, Weiming Hu
However, the most suitable positions for inferring different targets, i. e., the object category and boundaries, are generally different.
no code implementations • CVPR 2021 • Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, Weiming Hu
Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.
1 code implementation • 17 Oct 2020 • Guanghui Zhu, Zhuoer Xu, Xu Guo, Chunfeng Yuan, Yihua Huang
Extensive experiments on classification and regression datasets demonstrate that DIFER can significantly improve the performance of various machine learning algorithms and outperform current state-of-the-art AutoFE methods in terms of both efficiency and performance.
1 code implementation • 29 May 2020 • Jingfan Chen, Guanghui Zhu, Chunfeng Yuan, Yihua Huang
Bayesian optimization is a broadly applied methodology to optimize the expensive black-box function.
no code implementations • CVPR 2020 • Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zheng-Jun Zha
In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy.
Ranked #9 on
Video Captioning
on VATEX
(using extra training data)
no code implementations • 13 Oct 2019 • Ziqi Zhang, Yaya Shi, Jiutong Wei, Chunfeng Yuan, Bing Li, Weiming Hu
Multi-modal information is essential to describe what has happened in a video.
no code implementations • 8 May 2019 • Liang Sun, Bing Li, Chunfeng Yuan, Zheng-Jun Zha, Weiming Hu
Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning.
no code implementations • ECCV 2018 • Yang Du, Chunfeng Yuan, Bing Li, Lili Zhao, Yangxi Li, Weiming Hu
Furthermore, since different layers in a deep network capture feature maps of different scales, we use these feature maps to construct a spatial pyramid and then utilize multi-scale information to obtain more accurate attention scores, which are used to weight the local features in all spatial positions of feature maps to calculate attention maps.
no code implementations • CVPR 2017 • Yang Du, Chunfeng Yuan, Bing Li, Weiming Hu, Stephen Maybank
In dynamic object detection, it is challenging to construct an effective model to sufficiently characterize the spatial-temporal properties of the background.
1 code implementation • 19 Apr 2016 • Yanghao Li, Cuiling Lan, Junliang Xing, Wen-Jun Zeng, Chunfeng Yuan, Jiaying Liu
In this paper, we study the problem of online action detection from streaming skeleton data.
no code implementations • CVPR 2015 • Shuang Yang, Chunfeng Yuan, Baoxin Wu, Weiming Hu, Fangshi Wang
In this paper, a multi-feature max-margin hierarchical Bayesian model (M3HBM) is proposed for action recognition.
no code implementations • CVPR 2014 • Xinchu Shi, Haibin Ling, Weiming Hu, Chunfeng Yuan, Junliang Xing
In this paper, we model interactions between neighbor targets by pair-wise motion context, and further encode such context into the global association optimization.
no code implementations • CVPR 2014 • Baoxin Wu, Chunfeng Yuan, Weiming Hu
Then, the proposed CGKs are applied to measure the similarity between actions represented by the two-graph model.
no code implementations • CVPR 2013 • Chunfeng Yuan, Weiming Hu, Guodong Tian, Shuang Yang, Haoran Wang
In this paper, we formulate human action recognition as a novel Multi-Task Sparse Learning(MTSL) framework which aims to construct a test sample with multiple features from as few bases as possible.
no code implementations • CVPR 2013 • Chunfeng Yuan, Xi Li, Weiming Hu, Haibin Ling, Stephen Maybank
In this paper, we propose a new global feature to capture the detailed geometrical distribution of interest points.