1 code implementation • 4 Apr 2024 • Tiantian Geng, Teng Wang, yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL).
1 code implementation • 29 Dec 2023 • Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, JianGuo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly.
no code implementations • ICCV 2023 • Baoshuo Kan, Teng Wang, Wenpeng Lu, XianTong Zhen, Weili Guan, Feng Zheng
Pre-trained vision-language models, e. g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning.
1 code implementation • ICCV 2023 • Junjie Fei, Teng Wang, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng
In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios.
1 code implementation • ICCV 2023 • Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, Feng Zheng
Vision-language pre-training (VLP) models have shown vulnerability to adversarial examples in multimodal tasks.
1 code implementation • 26 Jun 2023 • Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan
Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently.
1 code implementation • 17 Jun 2023 • Yunlong Tang, Jinrui Zhang, Xiangchen Wang, Teng Wang, Feng Zheng
This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality.
1 code implementation • 4 May 2023 • Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao
Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e. g.}$, looking at the specified regions or telling in a particular text style.
1 code implementation • 27 Apr 2023 • Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo
Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.
1 code implementation • CVPR 2023 • Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo
FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.
1 code implementation • CVPR 2023 • Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng
To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
Ranked #1 on audio-visual event localization on UnAV-100
1 code implementation • 11 Mar 2023 • Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo
Our framework is easily extensible to tasks covering visually-grounded language understanding and generation.
1 code implementation • 2 Mar 2023 • Xiaoguang Chang, Teng Wang, Shaowei Cai, Changyin Sun
Besides, representation-level unbiased strategies endow LANDMARK the advantage of compatibility with other methods.
1 code implementation • 25 Sep 2022 • Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng
The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage.
1 code implementation • 3 Jul 2022 • Jinrui Zhang, Teng Wang, Feng Zheng, Ran Cheng, Ping Luo
Previous methods only process the information of a single boundary at a time, which lacks utilization of video context information.
1 code implementation • 17 Jun 2022 • Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo
Existing vision-language pre-training (VLP) methods primarily rely on paired image-text datasets, which are either annotated by enormous human labors, or crawled from the internet followed by elaborate data cleaning techniques.
no code implementations • 21 Apr 2022 • Teng Wang, Shujuan Fan, Daikun Liu, Changyin Sun
Furthermore, we design a dual-branch Transformer head network to combine image features from multi-scale windows in order to improve details of the global feature representation.
no code implementations • 13 Apr 2022 • Teng Wang, Zhu Liu, Feng Zheng, Zhichao Lu, Ran Cheng, Ping Luo
This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021.
1 code implementation • 17 Mar 2022 • Xiaoguang Chang, Teng Wang, Changyin Sun, Wenzhe Cai
Scene graph generation is a sophisticated task because there is no specific recognition pattern (e. g., "looking at" and "near" have no conspicuous difference concerning vision, whereas "near" could occur between entities with different morphology).
Ranked #1 on Predicate Classification on Visual Genome (mean Recall @20 metric)
1 code implementation • 5 Jan 2022 • Teng Wang, Bolun Sun, Yijie Tong
In this paper, we proposed a method that uses an auxiliary sentence about aspects that the sentence contains to help sentiment prediction.
Aspect-Based Sentiment Analysis Aspect-Based Sentiment Analysis (ABSA) +2
1 code implementation • Pattern Recognition 2021 • Teng Wang, Lin Wu, Changyin Sun
Using the coarse predicted image, we explicitly infer a more accurate dynamic mask to identify both dynamic objects and their shadows, so that the task could be effectively converted to an image inpainting problem.
2 code implementations • ICCV 2021 • Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video.
Ranked #5 on Dense Video Captioning on YouCook2
1 code implementation • 23 Jun 2021 • Libo Wang, Rui Li, Dongzhi Wang, Chenxi Duan, Teng Wang, Xiaoliang Meng
Specifically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convolution operation.
Ranked #4 on Semantic Segmentation on UAVid
no code implementations • 17 May 2021 • Lin Wu, Teng Wang, Changyin Sun
In this letter, we for the first time explore the use of multi-modal fusion of semantic and visual modalities in dynamics-invariant space to improve place recognition in dynamic environments.
1 code implementation • 21 Jun 2020 • Teng Wang, Huicheng Zheng, Mingjing Yu
This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Ranked #5 on Dense Video Captioning on ActivityNet Captions
no code implementations • 19 Apr 2020 • Yang Zhao, Jun Zhao, Mengmeng Yang, Teng Wang, Ning Wang, Lingjuan Lyu, Dusit Niyato, Kwok-Yan Lam
To avoid the privacy threat and reduce the communication cost, in this paper, we propose to integrate federated learning and local differential privacy (LDP) to facilitate the crowdsourcing applications to achieve the machine learning model.
no code implementations • 27 Nov 2019 • Jun Zhao, Teng Wang, Tao Bai, Kwok-Yan Lam, Zhiying Xu, Shuyu Shi, Xuebin Ren, Xinyu Yang, Yang Liu, Han Yu
Although both classical Gaussian mechanisms [1, 2] assume $0 < \epsilon \leq 1$, our review finds that many studies in the literature have used the classical Gaussian mechanisms under values of $\epsilon$ and $\delta$ where the added noise amounts of [1, 2] do not achieve $(\epsilon,\delta)$-DP.
no code implementations • 4 Jun 2019 • Teng Wang, Jun Zhao, Han Yu, Jinyan Liu, Xinyu Yang, Xuebin Ren, Shuyu Shi
To investigate such ethical dilemmas, recent studies have adopted preference aggregation, in which each voter expresses her/his preferences over decisions for the possible ethical dilemma scenarios, and a centralized system aggregates these preferences to obtain the winning decision.
no code implementations • 25 Dec 2018 • Teng Wang, Chao Wang, Xuehai Zhou, Huaping Chen
With the rapid development of in-depth learning, neural network and deep learning algorithms have been widely used in various fields, e. g., image, video and voice processing.