1 code implementation • ECCV 2020 • Shaoxiang Chen, Yu-Gang Jiang
Temporal Activity Localization via Language (TALL) in video is a recently proposed challenging vision task, and tackling it requires fine-grained understanding of the video content, however, this is overlooked by most of the existing works.
1 code implementation • 12 Mar 2024 • Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
This adaptation leads to convenient development of such LMMs with minimal modifications, however, it overlooks the intrinsic characteristics of diverse visual tasks and hinders the learning of perception capabilities.
no code implementations • 29 Jan 2024 • Shaoxiang Chen, Zequn Jie, Lin Ma
To address this issue, we propose to apply an efficient Mixture of Experts (MoE) design, which is a sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs.
no code implementations • 13 Dec 2023 • Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
no code implementations • 6 Jun 2023 • Wenfeng Yan, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang
The task of moment localization is to localize a temporal moment in an untrimmed video for a given natural language query.
1 code implementation • CVPR 2023 • Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as seeds) into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques.
no code implementations • 11 Jul 2022 • Shaoxiang Chen, Zequn Jie, Xiaolin Wei, Lin Ma
In this technical report, we introduce our submission to the Waymo 3D Detection leaderboard.
1 code implementation • 10 Mar 2022 • Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.
no code implementations • 23 Sep 2021 • Fan Luo, Shaoxiang Chen, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
no code implementations • 10 Aug 2021 • Junke Wang, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang
Blind face inpainting refers to the task of reconstructing visual contents without explicitly indicating the corrupted regions in a face image.
no code implementations • CVPR 2021 • Shaoxiang Chen, Yu-Gang Jiang
Dense Event Captioning (DEC) aims to jointly localize and describe multiple events of interest in untrimmed videos, which is an advancement of the conventional video captioning task (generating a single sentence description for a trimmed video).
no code implementations • ICCV 2021 • Shaoxiang Chen, Yu-Gang Jiang
In this paper, we aim at designing a spatial information extraction and aggregation method for video captioning without the need of external object detectors.
no code implementations • ECCV 2020 • Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang
Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks.
no code implementations • 10 Apr 2019 • Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, Yu-Gang Jiang
Using three benchmark video datasets, we demonstrate that V-BAD can craft both untargeted and targeted attacks to fool two state-of-the-art deep video recognition models.
no code implementations • 29 Sep 2018 • Yongyi Tang, Xing Zhang, Jingwen Wang, Shaoxiang Chen, Lin Ma, Yu-Gang Jiang
This paper describes our solution for the 2$^\text{nd}$ YouTube-8M video understanding challenge organized by Google AI.
no code implementations • 4 Jul 2017 • Shaoxiang Chen, Xi Wang, Yongyi Tang, Xinpeng Chen, Zuxuan Wu, Yu-Gang Jiang
This paper introduces the system we developed for the Google Cloud & YouTube-8M Video Understanding Challenge, which can be considered as a multi-label classification problem defined on top of the large scale YouTube-8M Dataset.