1 code implementation • ECCV 2020 • Shaoxiang Chen, Yu-Gang Jiang
Temporal Activity Localization via Language (TALL) in video is a recently proposed challenging vision task, and tackling it requires fine-grained understanding of the video content, however, this is overlooked by most of the existing works.
no code implementations • 25 Jun 2025 • Jiahao Lin, Weixuan Peng, Bojia Zi, Yifeng Gao, Xianbiao Qi, Xingjun Ma, Yu-Gang Jiang
Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models.
no code implementations • 15 Jun 2025 • Jiaming Zhang, Xin Wang, Xingjun Ma, Lingyu Qiu, Yu-Gang Jiang, Jitao Sang
Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces.
no code implementations • 11 Jun 2025 • Zilong Wang, Xiang Zheng, Xiaosen Wang, Bo wang, Xingjun Ma, Yu-Gang Jiang
Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts.
no code implementations • 6 Jun 2025 • Jingshun Huang, Haitao Lin, Tianyu Wang, Yanwei Fu, Yu-Gang Jiang, xiangyang xue
This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks.
1 code implementation • 4 Jun 2025 • Feng Han, Yang Jiao, Shaoxiang Chen, Junhao Xu, Jingjing Chen, Yu-Gang Jiang
The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals.
no code implementations • 25 May 2025 • HUI ZHANG, Dexiang Hong, Maoke Yang, Yutao Chen, Zhao Zhang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment.
no code implementations • 24 May 2025 • Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu-Gang Jiang, Nicu Sebe, DaCheng Tao, Luc van Gool, Xuming Hu
These findings highlight the need for balanced training strategies and model architectures to better integrate multiple modalities in MLLMs.
1 code implementation • 24 May 2025 • Jiayu Wang, Yang Jiao, Yue Yu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang
Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation.
no code implementations • 24 May 2025 • Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang
Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs).
no code implementations • 17 May 2025 • Yixu Wang, Jiaxin Song, Yifeng Gao, Xin Wang, Yang Yao, Yan Teng, Xingjun Ma, Yingchun Wang, Yu-Gang Jiang
SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning.
1 code implementation • 15 May 2025 • Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin
Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling.
no code implementations • 22 Apr 2025 • Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu, Yue Liu, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Zhaoxin Fan, Kai Wang, Yi Ding, Donghai Hong, Jiaming Ji, Yingxin Lai, Zitong Yu, Xinfeng Li, Yifan Jiang, Yanhui Li, Xinyu Deng, Junlin Wu, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Qiufeng Wang, Xiaolong Jin, Wenxuan Wang, Dongrui Liu, Yanwei Yue, Wenke Huang, Guancheng Wan, Heng Chang, Tianlin Li, Yi Yu, Chenghao Li, Jiawei Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Jiaheng Zhang, Tianwei Zhang, Xingjun Ma, Jindong Gu, Liang Pang, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Lingjuan Lyu, Yuval Elovici, Bhavya Kailkhura, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, XiaoFeng Wang, DaCheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu
Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e. g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs.
1 code implementation • 15 Apr 2025 • Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications.
no code implementations • 8 Apr 2025 • Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability.
1 code implementation • 6 Apr 2025 • Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks.
1 code implementation • 24 Mar 2025 • Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang
Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications.
no code implementations • 12 Mar 2025 • Xinghan Li, Jingjing Chen, Yue Yu, Xue Song, Haijun Shan, Yu-Gang Jiang
Furthermore, we design a new pipeline that pioneers the use of noise patterns, derived from a noise-based imprint extractor, alongside other visual features for AI-generated image detection, resulting in a significant improvement in performance.
no code implementations • 24 Feb 2025 • Mengtian Li, Shengxiang Yao, Chen Kai, Zhifeng Xie, Keyu Chen, Yu-Gang Jiang
Recent advancements in Gaussian-based human body reconstruction have achieved notable success in creating animatable avatars.
no code implementations • 16 Feb 2025 • Ming Xie, Chenjie Cao, Yunuo Cai, xiangyang xue, Yu-Gang Jiang
In this paper, we present a novel Left-Prompt-Guided (LPG) paradigm to address a diverse range of reference-based vision tasks.
1 code implementation • 2 Feb 2025 • Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, DaCheng Tao, James Bailey, Yu-Gang Jiang
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI).
1 code implementation • 2 Jan 2025 • Feng Han, Kai Chen, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image.
no code implementations • 2 Jan 2025 • Teng Li, Xingjun Ma, Yu-Gang Jiang
In this work, we focus on generative approaches for targeted transferable attacks.
no code implementations • 30 Dec 2024 • Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Yu-Gang Jiang, Philip H. S. Torr
Dynamic 3D scene representation and novel view synthesis from captured videos are crucial for enabling immersive experiences required by AR/VR and metaverse applications.
no code implementations • 28 Dec 2024 • Zhangxun Li, Mengyang Zhao, Xuan Yang, Yang Liu, Jiamu Sheng, Xinhua Zeng, Tian Wang, Kewei Wu, Yu-Gang Jiang
Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model's ability to represent anomalies.
no code implementations • 24 Dec 2024 • Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu
General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks.
1 code implementation • 23 Dec 2024 • Yitong Chen, Wenhao Yao, Lingchen Meng, Sihong Wu, Zuxuan Wu, Yu-Gang Jiang
Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection.
Ranked #11 on
Open Vocabulary Object Detection
on LVIS v1.0
(using extra training data)
no code implementations • 5 Dec 2024 • HUI ZHANG, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities.
1 code implementation • 4 Dec 2024 • Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.
Ranked #3 on
Visual Question Answering
on ViP-Bench
(using extra training data)
no code implementations • 3 Dec 2024 • Junqiu Yu, Xinlin Ren, Yongchong Gu, Haitao Lin, Tianyu Wang, Yi Zhu, Hang Xu, Yu-Gang Jiang, xiangyang xue, Yanwei Fu
Language-guided robotic grasping is a rapidly advancing field where robots are instructed using human language to grasp specific objects.
2 code implementations • 2 Dec 2024 • Zhixiang Wang, Guangnan Ye, Xiaosen Wang, Siheng Chen, Zhibo Wang, Xingjun Ma, Yu-Gang Jiang
However, most existing adversarial patch generation methods prioritize attack effectiveness over stealthiness, resulting in patches that are aesthetically unpleasing.
1 code implementation • 29 Nov 2024 • Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task.
no code implementations • 28 Nov 2024 • Xue Song, Jiequan Cui, Hanwang Zhang, Jiaxin Shi, Jingjing Chen, Chi Zhang, Yu-Gang Jiang
Furthermore, generalizable models for image editing with visual instructions typically require quad data, i. e., a before-after image pair, along with query and target images.
no code implementations • 25 Nov 2024 • Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen Ding, wei he, Boyang Hong, Shihan Do, WenYu Zhan, Xiao Wang, Rui Zheng, Tao Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang
Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model.
1 code implementation • 24 Nov 2024 • Yongkun Du, Zhineng Chen, Hongtao Xie, Caiyan Jia, Yu-Gang Jiang
In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
1 code implementation • 20 Nov 2024 • Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.
no code implementations • 19 Nov 2024 • Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity.
no code implementations • 13 Nov 2024 • Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients.
no code implementations • 5 Nov 2024 • Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang
In this paper, we propose a novel learning approach based on domain expansion and boundary growth to expand the scarce source samples and enlarge the boundaries across the known classes that indirectly broaden the boundary between the known and unknown classes.
no code implementations • 29 Oct 2024 • Ruofan Wang, Bo wang, Xiaosen Wang, Xingjun Ma, Yu-Gang Jiang
Specifically, IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model.
no code implementations • 28 Oct 2024 • Yunhan Zhao, Xiang Zheng, Lin Luo, Yige Li, Xingjun Ma, Yu-Gang Jiang
In this work, we focus on black-box defense for VLMs against jailbreak attacks.
1 code implementation • 27 Oct 2024 • Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, Xipeng Qiu
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3. 1-8B-Base model, with 32K and 128K features.
1 code implementation • 25 Oct 2024 • Yige Li, Hanxun Huang, Jiaming Zhang, Xingjun Ma, Yu-Gang Jiang
Specifically, EBYD first exposes the backdoor functionality in the backdoored model through a model preprocessing step called backdoor exposure, and then applies detection and removal methods to the exposed model to identify and eliminate the backdoor features.
1 code implementation • 13 Oct 2024 • Ye Sun, Hao Zhang, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang
In this work, we exploit the concept of unlearnable examples to make images unusable to model training by generating and adding unlearnable noise into the original images.
no code implementations • 25 Sep 2024 • Jiacheng Zhang, Yang Jiao, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang
To effectively instruct an MLLM, in addition to conventional language expressions, the practice of referring to objects by painting with brushes on images has emerged as a prevalent tool (referred to as "referring visual prompts") due to its efficacy in aligning the user's intention with specific image regions.
no code implementations • 11 Sep 2024 • Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei
Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i. e., Latent Diffusion Models) to denoise and enhance the image details.
no code implementations • 11 Sep 2024 • Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei
In the fine stage, DreamMesh jointly manipulates the mesh and refines the texture map, leading to high-quality triangle meshes with high-fidelity textured materials.
1 code implementation • 27 Aug 2024 • Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang
In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition.
1 code implementation • 11 Aug 2024 • Shuai Zhao, Yongkun Du, Zhineng Chen, Yu-Gang Jiang
Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training.
no code implementations • 10 Aug 2024 • Ziyi Gao, Kai Chen, Zhipeng Wei, Tingshu Mou, Jingjing Chen, Zhiyu Tan, Hao Li, Yu-Gang Jiang
However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos.
no code implementations • 7 Aug 2024 • Jiahao Zhang, Zilong Wang, Ruofan Wang, Xingjun Ma, Yu-Gang Jiang
As Large Language Models (LLMs) are increasingly being deployed in safety-critical applications, their vulnerability to potential jailbreaks -- malicious prompts that can disable the safety mechanism of LLMs -- has attracted growing research attention.
1 code implementation • 4 Aug 2024 • Xin Wang, Kai Chen, Xingjun Ma, Zhineng Chen, Jingjing Chen, Yu-Gang Jiang
During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space.
no code implementations • 3 Aug 2024 • Weijie Zheng, Xingjun Ma, Hanxun Huang, Zuxuan Wu, Yu-Gang Jiang
With the advancement of vision transformers (ViTs) and self-supervised learning (SSL) techniques, pre-trained large ViTs have become the new foundation models for computer vision applications.
1 code implementation • 17 Jul 2024 • Yongkun Du, Zhineng Chen, Caiyan Jia, Xieping Gao, Yu-Gang Jiang
In this paper, we term this task Out of Length (OOL) text recognition.
1 code implementation • 17 Jul 2024 • Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
In this work, we introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
no code implementations • 11 Jul 2024 • Mengtian Li, Chengshuo Zhai, Shengxiang Yao, Zhifeng Xie, Keyu Chen, Yu-Gang Jiang
We further demonstrate the versatility and practical utility of "Infinite Motion" through three specific applications: natural language interactive editing, motion sequence editing within long sequences and splicing of independent motion sequences.
no code implementations • 7 Jul 2024 • Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
1 code implementation • 4 Jul 2024 • Qian Feng, Hanbin Zhao, Chao Zhang, Jiahua Dong, Henghui Ding, Yu-Gang Jiang, Hui Qian
Prompt-fixed methods only learn a single set of prompts on one of the incremental tasks and can not handle all the incremental tasks effectively.
1 code implementation • 1 Jul 2024 • Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun
Moreover, 33. 2% of the questions are cross-page questions requiring evidence across multiple pages.
1 code implementation • 20 Jun 2024 • Xincheng Shuai, Henghui Ding, Xingjun Ma, RongCheng Tu, Yu-Gang Jiang, DaCheng Tao
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
no code implementations • 18 Jun 2024 • Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang
In this work, we aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
no code implementations • 17 Jun 2024 • Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou, Boning Wang, Jiaqi Huang, Zunnan Xu, Xiu Li, Kehong Yuan, Yanyan Zu, Jiayao Ha, Qiong Gao, Licheng Jiao
2) Open Vocabulary Object Detection: This track goes a step further, requiring algorithms to detect objects from an open set of categories, including unknown objects.
1 code implementation • 13 Jun 2024 • Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang
To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics.
Ranked #11 on
Video Prediction
on Kinetics-600 12 frames, 64x64
3 code implementations • 11 Jun 2024 • Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, Yu-Gang Jiang, Jose M. Alvarez
We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model.
Ranked #6 on
NavSim
on OpenScene
no code implementations • 11 Jun 2024 • Xing Zhang, Jiaxi Gu, Haoyu Zhao, Shicong Wang, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu, Yu-Gang Jiang
Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description.
no code implementations • 10 Jun 2024 • Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation.
no code implementations • 6 Jun 2024 • Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, Yu-Gang Jiang
The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer.
Ranked #19 on
Zero-Shot Video Question Answer
on NExT-QA
1 code implementation • 6 Jun 2024 • Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, wei he, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
Building generalist agents that can handle diverse tasks and evolve themselves across different environments is a long-term goal in the AI community.
1 code implementation • 30 May 2024 • Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing.
1 code implementation • 28 May 2024 • Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, Yu-Gang Jiang
Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
no code implementations • 25 May 2024 • Yifeng Gao, Yuhua Sun, Xingjun Ma, Zuxuan Wu, Yu-Gang Jiang
This paper presents a novel model protection paradigm ModelLock that locks (destroys) the performance of a model on normal clean data so as to make it unusable or unextractable without the right key.
1 code implementation • 24 May 2024 • Yuankun Yang, Li Zhang, Ziyang Xie, Zhiyuan Yuan, Jianfeng Feng, Xiatian Zhu, Yu-Gang Jiang
Conceptually, we reformulate this task as a {\em fMRI conditioned 3D object generation} problem.
no code implementations • 23 May 2024 • Haoran Chen, Micah Goldblum, Zuxuan Wu, Yu-Gang Jiang
A common problem in continual learning is the classification layer's bias towards the most recent task.
no code implementations • 20 May 2024 • Liuzhi Zhou, Yu He, Kun Zhai, Xiang Liu, Sen Liu, Xingjun Ma, Guangnan Ye, Yu-Gang Jiang, Hongfeng Chai
This comparative analysis revealed that due to the limited information contained within client models from other clients during the initial stages of federated learning, more substantial constraints need to be imposed on the parameters of the adaptive algorithm.
no code implementations • 21 Apr 2024 • Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang
Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity.
no code implementations • 19 Apr 2024 • Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Tianwen Qian, Bin Zhu, Na Zhao, Yu-Gang Jiang
Recently, Multimodal Large Language Models (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge.
no code implementations • 18 Apr 2024 • Kun Zhai, Yifeng Gao, Difan Zou, Guangnan Ye, Siheng Chen, Xingjun Ma, Yu-Gang Jiang
Federated Learning (FL) holds great potential for diverse applications owing to its privacy-preserving nature.
1 code implementation • CVPR 2024 • Yang Luo, Zhineng Chen, Peng Zhou, Zuxuan Wu, Xieping Gao, Yu-Gang Jiang
The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content.
1 code implementation • CVPR 2024 • Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang
The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.
1 code implementation • 15 Mar 2024 • Pagnarasmey Pit, Xingjun Ma, Mike Conway, Qingyu Chen, James Bailey, Henry Pit, Putrasmey Keo, Watey Diep, Yu-Gang Jiang
Large Language Models (LLMs) have gained significant popularity for their application in various everyday tasks such as text generation, summarization, and information retrieval.
no code implementations • 15 Mar 2024 • Qijun Feng, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang
We introduce GeoGS3D, a novel two-stage framework for reconstructing detailed 3D objects from single-view images.
no code implementations • 12 Mar 2024 • Guoshan Liu, Yang Jiao, Jingjing Chen, Bin Zhu, Yu-Gang Jiang
These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
1 code implementation • 12 Mar 2024 • Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
1 code implementation • CVPR 2024 • Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, Yu-Gang Jiang
Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning.
1 code implementation • 31 Jan 2024 • Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, Yu-Gang Jiang
We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e. g., character frequency, position, etc.
1 code implementation • 30 Jan 2024 • Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs.
Ranked #124 on
Visual Question Answering
on MM-Vet
1 code implementation • 27 Jan 2024 • Yige Li, Jiabo He, Hanxun Huang, Jun Sun, Xingjun Ma, Yu-Gang Jiang
Backdoor attacks have become a significant threat to the pre-training and deployment of deep neural networks (DNNs).
1 code implementation • 11 Jan 2024 • Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data.
no code implementations • 22 Dec 2023 • Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, Chong-Wah Ngo
In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain.
no code implementations • 13 Dec 2023 • Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
no code implementations • 30 Nov 2023 • Zhen Xing, Qi Dai, Zihao Zhang, HUI ZHANG, Han Hu, Zuxuan Wu, Yu-Gang Jiang
Our model can edit and translate the desired results within seconds based on user instructions.
1 code implementation • CVPR 2024 • Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance.
1 code implementation • 29 Nov 2023 • Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Qingping Zheng, Zuxuan Wu, Hang Xu, Yu-Gang Jiang
The high-fidelity alignment is developed to further enhance the fidelity of both video generation and editing by taking the subject image as an additional model input.
Ranked #1 on
Video Generation
on MSR-VTT
no code implementations • 24 Nov 2023 • HUI ZHANG, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions.
1 code implementation • 24 Nov 2023 • Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang
In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target.
1 code implementation • 19 Nov 2023 • Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, Jitao Sang
With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities.
2 code implementations • 13 Nov 2023 • Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang
Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data.
Ranked #109 on
Visual Question Answering
on MM-Vet
1 code implementation • 10 Nov 2023 • Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, Yingchun Wang
The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety.
1 code implementation • 16 Oct 2023 • Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
However, existing surveys mainly focus on diffusion models in the context of image generation, with few up-to-date reviews on their application in the video domain.
1 code implementation • 8 Oct 2023 • Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.
1 code implementation • 7 Sep 2023 • Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei zhang, Yu-Gang Jiang, Hang Xu
Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process.
no code implementations • CVPR 2024 • Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1. 1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
no code implementations • 14 Aug 2023 • Yilun Zhang, Yuqian Fu, Xingjun Ma, Lizhe Qi, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information.
2 code implementations • 23 Jul 2023 • Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, Yu-Gang Jiang
We first present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
Ranked #1 on
Scene Text Recognition
on CUTE80
(using extra training data)
2 code implementations • 27 Jun 2023 • Yuchen Su, Zhineng Chen, Zhiwen Shao, Yuning Du, Zhilong Ji, Jinfeng Bai, Yong Zhou, Yu-Gang Jiang
Next, we propose a dual assignment scheme for speed acceleration.
no code implementations • 6 Jun 2023 • Wenfeng Yan, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang
The task of moment localization is to localize a temporal moment in an untrimmed video for a given natural language query.
1 code implementation • 24 May 2023 • Yige Li, Xixiang Lyu, Xingjun Ma, Nodens Koren, Lingjuan Lyu, Bo Li, Yu-Gang Jiang
Specifically, RNP first unlearns the neurons by maximizing the model's error on a small subset of clean samples and then recovers the neurons by minimizing the model's error on the same data.
1 code implementation • ICCV 2023 • Tianlun Zheng, Zhineng Chen, Bingchen Huang, Wei zhang, Yu-Gang Jiang
In this paper, we propose the Incremental MLTR (IMLTR) task in the context of incremental learning (IL), where different languages are introduced in batches.
Ranked #1 on
Incremental Learning
on MLT17
1 code implementation • 24 May 2023 • Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang
We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues.
1 code implementation • 9 May 2023 • Tianlun Zheng, Zhineng Chen, Jinfeng Bai, Hongtao Xie, Yu-Gang Jiang
In this work, we introduce TPS++, an attention-enhanced TPS transformation that incorporates the attention mechanism to text rectification for the first time.
Ranked #1 on
Scene Text Recognition
on SVT-P
no code implementations • 27 Apr 2023 • Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang
Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios.
1 code implementation • ICCV 2023 • Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, Yu-Gang Jiang
While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention.
Ranked #21 on
Action Classification
on Kinetics-400
no code implementations • 21 Mar 2023 • Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Xiyang Dai, Lu Yuan, Yu-Gang Jiang
Object tracking (OT) aims to estimate the positions of target objects in a video sequence.
2 code implementations • 15 Mar 2023 • HUI ZHANG, Zheng Wang, Dan Zeng, Zuxuan Wu, Yu-Gang Jiang
We introduce DiffusionAD, a novel anomaly detection pipeline comprising a reconstruction sub-network and a segmentation sub-network.
Ranked #1 on
Unsupervised Anomaly Detection
on DAGM2007
1 code implementation • 13 Mar 2023 • Haoran Chen, Zuxuan Wu, Xintong Han, Menglin Jia, Yu-Gang Jiang
Current research on continual learning mainly focuses on relieving catastrophic forgetting, and most of their success is at the cost of limiting the performance of newly incoming tasks.
2 code implementations • CVPR 2023 • Yuqian Fu, Yu Xie, Yanwei Fu, Yu-Gang Jiang
Thus, inspired by vanilla adversarial learning, a novel model-agnostic meta Style Adversarial training (StyleAdv) method together with a novel style adversarial attack method is proposed for CD-FSL.
Ranked #1 on
Cross-Domain Few-Shot
on Plantae
1 code implementation • 1 Feb 2023 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang
Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization.
1 code implementation • 3 Jan 2023 • Yanwei Fu, Xiaomei Wang, Hanze Dong, Yu-Gang Jiang, Meng Wang, xiangyang xue, Leonid Sigal
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels.
no code implementations • CVPR 2023 • Kexin Sun, Zhineng Chen, Gongwei Wang, Jun Liu, Xiongjun Ye, Yu-Gang Jiang
In order to eliminate the square effect, we design a bi-directional feature fusion generative adversarial network (BFF-GAN) with a global branch and a local branch.
1 code implementation • CVPR 2023 • Jiaming Zhang, Xingjun Ma, Qi Yi, Jitao Sang, Yu-Gang Jiang, YaoWei Wang, Changsheng Xu
Furthermore, we propose to leverage VisionandLanguage Pre-trained Models (VLPMs) like CLIP as the surrogate model to improve the transferability of the crafted UCs to diverse domains.
no code implementations • CVPR 2023 • Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Chuanxin Tang, Xiyang Dai, Yucheng Zhao, Yujia Xie, Lu Yuan, Yu-Gang Jiang
Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
Ranked #1 on
Semi-Supervised Video Object Segmentation
on Long Video Dataset
(using extra training data)
no code implementations • 12 Dec 2022 • Junke Wang, Zhenxin Li, Chao Zhang, Jingjing Chen, Zuxuan Wu, Larry S. Davis, Yu-Gang Jiang
Online media data, in the forms of images and videos, are becoming mainstream communication channels.
4 code implementations • CVPR 2023 • Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks.
Ranked #1 on
Self-Supervised Action Recognition
on HMDB51
no code implementations • CVPR 2023 • HUI ZHANG, Zuxuan Wu, Zheng Wang, Zhineng Chen, Yu-Gang Jiang
Anomaly detection and localization are widely used in industrial manufacturing for its efficiency and effectiveness.
Ranked #4 on
Supervised Anomaly Detection
on MVTec AD
(using extra training data)
1 code implementation • CVPR 2023 • Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang
We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions.
no code implementations • 29 Nov 2022 • Huiyan Qi, Lechao Cheng, Jingjing Chen, Yue Yu, Xue Song, Zunlei Feng, Yu-Gang Jiang
Transfer learning aims to improve the performance of target tasks by transferring knowledge acquired in source tasks.
1 code implementation • CVPR 2023 • Zhen Xing, Qi Dai, Han Hu, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
In this paper, we investigate the use of transformer models under the SSL setting for action recognition.
1 code implementation • 11 Oct 2022 • Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, Yu-Gang Jiang
The proposed TGDM framework contains a Mixup-3T network for learning classifiers and a dynamic ratio generation network (DRGN) for learning the optimal mix ratio.
1 code implementation • 11 Oct 2022 • Yuqian Fu, Yu Xie, Yanwei Fu, Jingjing Chen, Yu-Gang Jiang
Concretely, to solve the data imbalance problem between the source data with sufficient examples and the auxiliary target data with limited examples, we build our model under the umbrella of multi-expert learning.
no code implementations • 6 Oct 2022 • Xue Song, Jingjing Chen, Bin Zhu, Yu-Gang Jiang
Specifically, appearance and motion components are provided by the image and caption separately.
no code implementations • 5 Oct 2022 • Tianwen Qian, Ran Cui, Jingjing Chen, Pai Peng, Xiaowei Guo, Yu-Gang Jiang
Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only.
1 code implementation • NeurIPS 2023 • Haoran Chen, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
Most existing methods for unsupervised domain adaptation (UDA) rely on a shared network to extract domain-invariant features.
Multi-Source Unsupervised Domain Adaptation
Prompt Learning
+1
1 code implementation • 30 Sep 2022 • Zhen Xing, Hengduo Li, Zuxuan Wu, Yu-Gang Jiang
In particular, we introduce an attention-guided prototype shape prior module for guiding realistic object reconstruction.
no code implementations • 15 Sep 2022 • Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
Ranked #4 on
Cross-Modal Retrieval
on Flickr30k
(using extra training data)
1 code implementation • CVPR 2023 • Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
Our new attack method is proposed based on the observation that highly universal adversarial perturbations tend to be more transferable for targeted attacks.
1 code implementation • CVPR 2023 • Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as seeds) into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques.
no code implementations • 25 Aug 2022 • Rui Wang, Zuxuan Wu, Dongdong Chen, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Luowei Zhou, Lu Yuan, Yu-Gang Jiang
To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e. g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism.
1 code implementation • CVPR 2022 • Jianggang Zhu, Zheng Wang, Jingjing Chen, Yi-Ping Phoebe Chen, Yu-Gang Jiang
In this paper, we focus on representation learning for imbalanced data.
1 code implementation • 30 Jun 2022 • Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, Jin Gao, Weiming Hu, Yu-Gang Jiang
3D object detection in autonomous driving aims to reason "what" and "where" the objects of interest present in a 3D world.
Ranked #2 on
Robust Camera Only 3D Object Detection
on nuScenes-C
no code implementations • CVPR 2023 • Lingchen Meng, Xiyang Dai, Yinpeng Chen, Pengchuan Zhang, Dongdong Chen, Mengchen Liu, JianFeng Wang, Zuxuan Wu, Lu Yuan, Yu-Gang Jiang
Detection Hub further achieves SoTA performance on UODB benchmark with wide variety of datasets.
4 code implementations • 30 Apr 2022 • Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, Yu-Gang Jiang
Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription.
Ranked #16 on
Scene Text Recognition
on ICDAR2013
no code implementations • 26 Apr 2022 • Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu-Gang Jiang
With Vision Transformers (ViTs) making great advances in a variety of computer vision tasks, recent literature have proposed various variants of vanilla ViTs to achieve better efficiency and efficacy.
1 code implementation • 26 Apr 2022 • Zixuan Su, Hao Zhang, Jingjing Chen, Lei Pang, Chong-Wah Ngo, Yu-Gang Jiang
Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers.
1 code implementation • 20 Apr 2022 • Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen, Xiaowei Guo, Huyang Sun, Yu-Gang Jiang
Weakly supervised methods only rely on the paired video and query, but the performance is relatively poor.
no code implementations • CVPR 2022 • Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, Yu-Gang Jiang
Recent advances in image editing techniques have posed serious challenges to the trustworthiness of multimedia data, which drives the research of image tampering detection.
1 code implementation • 15 Mar 2022 • Yuqian Fu, Yu Xie, Yanwei Fu, Jingjing Chen, Yu-Gang Jiang
The key challenge of CD-FSL lies in the huge data shift between source and target domains, which is typically in the form of totally different visual styles.
Ranked #3 on
Cross-Domain Few-Shot
on CUB
no code implementations • 10 Mar 2022 • Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders.
1 code implementation • 10 Mar 2022 • Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.
Ranked #7 on
3D dense captioning
on ScanRefer Dataset
no code implementations • CVPR 2022 • Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
This paper investigates the transferability of adversarial perturbation across different modalities, i. e., leveraging adversarial perturbation generated on white-box image models to attack black-box video models.
no code implementations • 10 Dec 2021 • Tianyi Liu, Zuxuan Wu, Wenhan Xiong, Jingjing Chen, Yu-Gang Jiang
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model, and a feasible way to improve both tasks is to use more data.
1 code implementation • CVPR 2022 • Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, Lu Yuan
This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i. e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations.
Ranked #9 on
Action Recognition
on Diving-48
1 code implementation • CVPR 2022 • Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, Ser-Nam Lim
To this end, we introduce AdaViT, an adaptive computation framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone on a per-input basis, aiming to improve inference efficiency of vision transformers with a minimal drop of accuracy for image recognition.
1 code implementation • 23 Nov 2021 • Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang
Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost.
3 code implementations • 22 Nov 2021 • Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, Yu-Gang Jiang
In this paper, we propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding.
Ranked #12 on
Scene Text Recognition
on ICDAR2015
1 code implementation • 22 Nov 2021 • Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang
Surprisingly, we show Vision Transformers perform significantly worse than Convolutional Neural Networks when only a small set of labeled data is available.
1 code implementation • 29 Oct 2021 • Kai Chen, Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
On both UCF-101 and HMDB-51 datasets, our BSC attack method can achieve about 90\% fooling rate when attacking three mainstream video recognition models, while only occluding \textless 8\% areas in the video.
Adversarial Attack
Adversarial Attack on Video Classification
+2
1 code implementation • 18 Oct 2021 • Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
To this end, we propose to boost the transferability of video adversarial examples for black-box attacks on video recognition models.
1 code implementation • 9 Oct 2021 • Yang Jiao, Zequn Jie, Weixin Luo, Jingjing Chen, Yu-Gang Jiang, Xiaolin Wei, Lin Ma
Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
no code implementations • 23 Sep 2021 • Fan Luo, Shaoxiang Chen, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
2 code implementations • 9 Sep 2021 • Zhipeng Wei, Jingjing Chen, Micah Goldblum, Zuxuan Wu, Tom Goldstein, Yu-Gang Jiang
We evaluate the transferability of attacks on state-of-the-art ViTs, CNNs and robustly trained CNNs.
no code implementations • 29 Aug 2021 • Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively.
1 code implementation • ICCV 2021 • Bojia Zi, Shihao Zhao, Xingjun Ma, Yu-Gang Jiang
We empirically demonstrate the effectiveness of our RSLAD approach over existing adversarial training and distillation methods in improving the robustness of small models against state-of-the-art attacks including the AutoAttack.
no code implementations • 10 Aug 2021 • Junke Wang, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang
Blind face inpainting refers to the task of reconstructing visual contents without explicitly indicating the corrupted regions in a face image.
1 code implementation • 26 Jul 2021 • Yuqian Fu, Yanwei Fu, Yu-Gang Jiang
Secondly, a novel disentangle module together with a domain classifier is proposed to extract the disentangled domain-irrelevant and domain-specific features.
no code implementations • 25 Jul 2021 • Yuqian Fu, Yanwei Fu, Yu-Gang Jiang
To achieve this, a novel Mesh-based Video Action Imitation (M-VAI) method is proposed by us.
no code implementations • CVPR 2021 • Shaoxiang Chen, Yu-Gang Jiang
Dense Event Captioning (DEC) aims to jointly localize and describe multiple events of interest in untrimmed videos, which is an advancement of the conventional video captioning task (generating a single sentence description for a trimmed video).
1 code implementation • 10 Jun 2021 • Rui Wang, Zuxuan Wu, Zejia Weng, Jingjing Chen, Guo-Jun Qi, Yu-Gang Jiang
Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a fully-labeled source domain to a different unlabeled target domain.
1 code implementation • ICCV 2021 • Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang Jiang, Larry Davis
In this paper, we introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition.
no code implementations • 20 Apr 2021 • Zejia Weng, Zuxuan Wu, Hengduo Li, Jingjing Chen, Yu-Gang Jiang
Conventional video recognition pipelines typically fuse multimodal features for improved performance.
3 code implementations • 20 Apr 2021 • Junke Wang, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Ser-Nam Lim, Yu-Gang Jiang
The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images.
no code implementations • 18 Jan 2021 • Shihao Zhao, Xingjun Ma, Yisen Wang, James Bailey, Bo Li, Yu-Gang Jiang
In this paper, we focus on image classification and propose a method to visualize and understand the class-wise knowledge (patterns) learned by DNNs under three different settings including natural, backdoor and adversarial.
1 code implementation • 5 Jan 2021 • Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, Yu-Gang Jiang
WildDeepfake is a small dataset that can be used, in addition to existing datasets, to develop and test the effectiveness of deepfake detectors against real-world deepfakes.
no code implementations • ICCV 2021 • Shaoxiang Chen, Yu-Gang Jiang
In this paper, we aim at designing a spatial information extraction and aggregation method for video captioning without the need of external object detectors.
no code implementations • 31 Dec 2020 • Zhi-Qin Zhan, Huazhu Fu, Yan-Yao Yang, Jingjing Chen, Jie Liu, Yu-Gang Jiang
However, there are several issues between the image-based training and video-based inference, including domain differences, lack of positive samples, and temporal smoothness.
1 code implementation • 20 Oct 2020 • Yuqian Fu, Li Zhang, Junke Wang, Yanwei Fu, Yu-Gang Jiang
Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs.
Ranked #2 on
Few Shot Action Recognition
on Kinetics-100
no code implementations • 28 Sep 2020 • Linxi Jiang, Xingjun Ma, Zejia Weng, James Bailey, Yu-Gang Jiang
Evaluating the robustness of a defense model is a challenging task in adversarial robustness research.
no code implementations • 20 Aug 2020 • Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, Tat-Seng Chua
Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe.
no code implementations • ECCV 2020 • Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang
Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks.
1 code implementation • 24 Jun 2020 • Xingjun Ma, Linxi Jiang, Hanxun Huang, Zejia Weng, James Bailey, Yu-Gang Jiang
Evaluating the robustness of a defense model is a challenging task in adversarial robustness research.
no code implementations • 26 May 2020 • Xuelin Qian, Wenxuan Wang, Li Zhang, Fangrui Zhu, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, xiangyang xue
Specifically, we consider that under cloth-changes, soft-biometrics such as body shape would be more reliable.
1 code implementation • CVPR 2020 • Hangyu Lin, Yanwei Fu, Yu-Gang Jiang, xiangyang xue
Unfortunately, the representation learned by SketchRNN is primarily for the generation tasks, rather than the other tasks of recognition and retrieval of sketches.
1 code implementation • CVPR 2020 • Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bailey, Jingjing Chen, Yu-Gang Jiang
We propose the use of a universal adversarial trigger as the backdoor trigger to attack video recognition models, a situation where backdoor attacks are likely to be challenged by the above 4 strict conditions.
no code implementations • 17 Jan 2020 • Wenxuan Wang, Yanwei Fu, Qiang Sun, Tao Chen, Chenjie Cao, Ziqi Zheng, Guoqiang Xu, Han Qiu, Yu-Gang Jiang, xiangyang xue
Considering the phenomenon of uneven data distribution and lack of samples is common in real-world scenarios, we further evaluate several tasks of few-shot expression learning by virtue of our F2ED, which are to recognize the facial expressions given only few training instances.
Facial Expression Recognition
Facial Expression Recognition (FER)
+1
no code implementations • NeurIPS 2019 • Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, Larry S. Davis
This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios.
1 code implementation • 21 Nov 2019 • Zhipeng Wei, Jingjing Chen, Xingxing Wei, Linxi Jiang, Tat-Seng Chua, Fengfeng Zhou, Yu-Gang Jiang
To overcome this challenge, we propose a heuristic black-box attack model that generates adversarial perturbations only on the selected frames and regions.
no code implementations • 25 Sep 2019 • Qiang Sun, Zhinan Cheng, Yanwei Fu, Wenxuan Wang, Yu-Gang Jiang, xiangyang xue
Instead of learning the cross features directly, DeepEnFM adopts the Transformer encoder as a backbone to align the feature embeddings with the clues of other fields.
no code implementations • 10 Apr 2019 • Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, Yu-Gang Jiang
Using three benchmark video datasets, we demonstrate that V-BAD can craft both untargeted and targeted attacks to fool two state-of-the-art deep video recognition models.
1 code implementation • 21 Dec 2018 • Guoyun Tu, Yanwei Fu, Boyang Li, Jiarui Gao, Yu-Gang Jiang, xiangyang xue
However, the sparsity of emotional expressions in the videos poses an obstacle to visual emotion analysis.
no code implementations • 28 Nov 2018 • Peng Lu, Hangyu Lin, Yanwei Fu, Shaogang Gong, Yu-Gang Jiang, xiangyang xue
Additionally, to study the tasks of sketch-based hairstyle retrieval, this paper contributes a new instance-level photo-sketch dataset - Hairstyle Photo-Sketch dataset, which is composed of 3600 sketches and photos, and 2400 sketch-photo pairs.
no code implementations • 16 Nov 2018 • You Qiaoben, Zheng Wang, Jianguo Li, Yinpeng Dong, Yu-Gang Jiang, Jun Zhu
Binary neural networks have great resource and computing efficiency, while suffer from long training procedure and non-negligible accuracy drops, when comparing to the full-precision counterparts.
no code implementations • 29 Sep 2018 • Yongyi Tang, Xing Zhang, Jingwen Wang, Shaoxiang Chen, Lin Ma, Yu-Gang Jiang
This paper describes our solution for the 2$^\text{nd}$ YouTube-8M video understanding challenge organized by Google AI.
1 code implementation • 25 Sep 2018 • Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, xiangyang xue
Thus, a better solution to handle these critical problems is to train object detectors from scratch, which motivates our proposed method.
3 code implementations • 19 Sep 2018 • Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, Tat-Seng Chua
As such, the key to an item-based CF method is in the estimation of item similarities.
no code implementations • ECCV 2018 • Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, Tong Zhang
In this paper, in order to exploit the complementary information from multiple encoders, we propose a novel Recurrent Fusion Network (RFNet) for tackling image captioning.
no code implementations • ECCV 2018 • Minjun Li, Hao-Zhi Huang, Lin Ma, Wei Liu, Tong Zhang, Yu-Gang Jiang
Recent studies on unsupervised image-to-image translation have made a remarkable progress by training a pair of generative adversarial networks with a cycle-consistent loss.
no code implementations • ACL 2018 • Minlong Peng, Qi Zhang, Yu-Gang Jiang, Xuanjing Huang
And we introduce a few target domain labeled data for learning domain-specific information.
1 code implementation • 15 Apr 2018 • Zitian Chen, Yanwei Fu, yinda zhang, Yu-Gang Jiang, xiangyang xue, Leonid Sigal
In semantic space, we search for related concepts, which are then projected back into the image feature spaces by the decoder portion of the TriNet.
no code implementations • 12 Apr 2018 • Jinhui Tang, Xiangbo Shu, Zechao Li, Yu-Gang Jiang, Qi Tian
Recent approaches simultaneously explore visual, user and tag information to improve the performance of image retagging by constructing and exploring an image-tag-user graph.
5 code implementations • ECCV 2018 • Nanyang Wang, yinda zhang, Zhuwen Li, Yanwei Fu, Wei Liu, Yu-Gang Jiang
We propose an end-to-end deep learning architecture that produces a 3D shape in triangular mesh from a single color image.
Ranked #3 on
3D Object Reconstruction
on Data3D−R2N2
(Avg F1 metric)
1 code implementation • 8 Feb 2018 • Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, xiangyang xue
This paper targets at learning to score the figure skating sports videos.