no code implementations • 20 Dec 2024 • Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, Beibei Li
Given some video-query pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos.
1 code implementation • 18 Dec 2024 • Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang
As a general parameter-efficient way, a common solution is to utilize adapter modules to transfer the vision-language alignment ability of Vision-Language Pretraining (VLP) models from a source language to a target language.
1 code implementation • 24 Aug 2024 • Chen Rao, Guangyuan Li, Zehua Lan, Jiakai Sun, Junsheng Luan, Wei Xing, Lei Zhao, Huaizhong Lin, Jianfeng Dong, Dalong Zhang
Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task.
1 code implementation • 1 Aug 2024 • Xiaoye Qu, Mingyang Song, Wei Wei, Jianfeng Dong, Yu Cheng
In this paper, we make the first attempt to mitigate this important multilingual hallucination in LVLMs.
no code implementations • 1 Aug 2024 • Xiaoye Qu, Qiyuan Chen, Wei Wei, Jishuo Sun, Jianfeng Dong
To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1. 5, Qwen-VL, and mPLUG-Owl2) across four benchmarks.
1 code implementation • 3 Apr 2024 • Zhonglin Liu, ShuJie Chen, Jianfeng Dong, Xun Wang, Di Zhou
Achieving high-performance in multi-object tracking algorithms heavily relies on modeling spatio-temporal relationships during the data association stage.
1 code implementation • 15 Dec 2023 • Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, Lei Yang
Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.
no code implementations • 14 Dec 2023 • Yabing Wang, Fan Wang, Jianfeng Dong, Hao Luo
Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs.
1 code implementation • 6 Nov 2023 • Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun Yang, Xun Wang, Meng Wang
In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases.
1 code implementation • 13 Sep 2023 • Zhenguang Liu, Xinyang Yu, Ruili Wang, Shuai Ye, Zhe Ma, Jianfeng Dong, Sifeng He, Feng Qian, Xiaobo Zhang, Roger Zimmermann, Lei Yang
We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature.
no code implementations • 11 Sep 2023 • Yabing Wang, Shuhui Wang, Hao Luo, Jianfeng Dong, Fan Wang, Meng Han, Xun Wang, Meng Wang
Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
1 code implementation • 17 May 2023 • Jianfeng Dong, Xiaoman Peng, Zhe Ma, Daizong Liu, Xiaoye Qu, Xun Yang, Jixiang Zhu, Baolong Liu
As the attribute-specific similarity typically corresponds to the specific subtle regions of images, we propose a Region-to-Patch Framework (RPF) that consists of a region-aware branch and a patch-aware branch to extract fine-grained attribute-related visual features for precise retrieval in a coarse-to-fine manner.
no code implementations • 6 May 2023 • Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Zichuan Xu, Haozhao Wang, Xing Di, Weining Lu, Yu Cheng
This paper addresses the temporal sentence grounding (TSG).
1 code implementation • ICCV 2023 • Jianfeng Dong, Minsong Zhang, Zheng Zhang, Xianke Chen, Daizong Liu, Xiaoye Qu, Xun Wang, Baolong Liu
During the knowledge distillation, an inheritance student branch is devised to absorb the knowledge from the teacher model.
1 code implementation • 5 Dec 2022 • Jianfeng Dong, Shengkai Sun, Zhonglin Liu, ShuJie Chen, Baolong Liu, Xun Wang
This paper targets unsupervised skeleton-based action representation learning and proposes a new Hierarchical Contrast (HiCo) framework.
Action Recognition
Few-Shot Skeleton-Based Action Recognition
+4
1 code implementation • 26 Aug 2022 • Yabing Wang, Jianfeng Dong, Tianxiang Liang, Minsong Zhang, Rui Cai, Xun Wang
In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
1 code implementation • 26 Aug 2022 • Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, ShuJie Chen, Xirong Li, Xun Wang
To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR).
Ranked #1 on
Partially Relevant Video Retrieval
on TVR
1 code implementation • 23 Jan 2022 • Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, Xun Wang
In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval.
1 code implementation • 3 Dec 2021 • Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, Xirong Li
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval.
Ranked #1 on
Ad-hoc video search
on TRECVID-AVS20 (V3C1)
(using extra training data)
no code implementations • EMNLP 2021 • Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou
However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction.
1 code implementation • 6 Apr 2021 • Jianfeng Dong, Zhe Ma, Xiaofeng Mao, Xun Yang, Yuan He, Richang Hong, Shouling Ji
In this similarity paradigm, one should pay more attention to the similarity in terms of a specific design/attribute between fashion items.
no code implementations • CVPR 2021 • Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, Yulai Xie
This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
1 code implementation • 18 Feb 2021 • Zhe Ma, Fenghao Liu, Jianfeng Dong, Xiaoye Qu, Yuan He, Shouling Ji
In this paper, we focus on the cross-modal similarity measurement, and propose a novel Hierarchical Similarity Learning (HSL) network.
no code implementations • 2 Feb 2021 • Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, Xun Wang
The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments.
no code implementations • COLING 2020 • Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou
In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation.
1 code implementation • 10 Sep 2020 • Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, Meng Wang
In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own.
Ranked #3 on
Ad-hoc video search
on TRECVID-AVS16 (IACC.3)
(using extra training data)
no code implementations • 6 Aug 2020 • Xiaoye Qu, Pengwei Tang, Zhikang Zhou, Yu Cheng, Jianfeng Dong, Pan Zhou
In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
1 code implementation • 4 Aug 2020 • Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, Pan Zhou, Zichuan Xu
To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
no code implementations • 6 Jul 2020 • Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, Tat-Seng Chua
To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos.
1 code implementation • 8 Apr 2020 • Jianfeng Dong, Xun Wang, Leimin Zhang, Chaoxi Xu, Gang Yang, Xirong Li
Predicting the relevance between two given videos with respect to their visual content is a key component for content-based video recommendation and retrieval.
1 code implementation • 7 Feb 2020 • Zhe Ma, Jianfeng Dong, Yao Zhang, Zhongzi Long, Yuan He, Hui Xue, Shouling Ji
This paper strives to learn fine-grained fashion similarity.
1 code implementation • CVPR 2019 • Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang
This paper attacks the challenging problem of zero-example video retrieval.
no code implementations • 19 Sep 2017 • Tingting Qiao, Jianfeng Dong, Duanqing Xu
Since there is a lack of human attention data, we first propose a Human Attention Network (HAN) to generate human-like attention maps, training on a recently released dataset called Human ATtention Dataset (VQA-HAT).
1 code implementation • 5 Sep 2017 • Jianfeng Dong, Xirong Li, Duanqing Xu
To quantify the current progress, we propose a simple text2image method, representing a novel test query by a set of images selected from large-scale query log.
1 code implementation • 5 Sep 2017 • Jianfeng Dong, Xirong Li, Cees G. M. Snoek
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video.
1 code implementation • 15 Aug 2017 • Weiyu Lan, Xirong Li, Jianfeng Dong
The framework comprises a module to automatically estimate the fluency of the sentences and another module to utilize the estimated fluency scores to effectively train an image captioning model for the target language.
1 code implementation • 28 Nov 2016 • Jianfeng Dong, Xiao-Jiao Mao, Chunhua Shen, Yu-Bin Yang
In this paper, we investigate convolutional denoising auto-encoders to show that unsupervised pre-training can still improve the performance of high-level image related tasks such as image classification and semantic segmentation.
no code implementations • 23 Apr 2016 • Jianfeng Dong, Xirong Li, Cees G. M. Snoek
This paper strives to find the sentence best describing the content of an image or video.