1 code implementation • 10 Mar 2025 • Ouxiang Li, YuAn Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng
Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations.
no code implementations • 31 Jan 2025 • Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, Yanbin Hao
Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity.
no code implementations • 15 Jan 2025 • YuAn Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang
These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images.
1 code implementation • 9 Dec 2024 • YuAn Wang, Ouxiang Li, Tingting Mu, Yanbin Hao, Kuien Liu, Xiang Wang, Xiangnan He
The success of text-to-image generation enabled by diffuion models has imposed an urgent need to erase unwanted concepts, e. g., copyrighted, offensive, and unsafe ones, from the pre-trained models in a precise, timely, and low-cost manner.
no code implementations • 25 Oct 2024 • Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
no code implementations • 28 Aug 2024 • Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao
The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation.
1 code implementation • 13 Aug 2024 • Ouxiang Li, Jiayin Cai, Yanbin Hao, XiaoLong Jiang, Yao Hu, Fuli Feng
In this paper, we re-examine the SID problem and identify two prevalent biases in current training paradigms, i. e., weakened artifact features and overfitted artifact features.
1 code implementation • 24 Jul 2024 • Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang
Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts.
1 code implementation • 19 Jul 2024 • Jinda Lu, Shuo Wang, Yanbin Hao, Haifeng Liu, Xiang Wang, Meng Wang
However, these adaptation methods are usually operated on the global view of an input image, and thus biased perception of partial local details of the image.
1 code implementation • 16 Jul 2024 • Ouxiang Li, Yanbin Hao, Zhicai Wang, Bin Zhu, Shuo Wang, Zaixi Zhang, Fuli Feng
To alleviate these issues, leveraging on diffusion models' remarkable synthesis capabilities, we propose Diffusion-based Model Inversion (Diff-MI) attacks.
1 code implementation • 3 Jul 2024 • Yanbin Hao, Diansong Zhou, Zhicai Wang, Chong-Wah Ngo, Meng Wang
In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks.
1 code implementation • 27 Jun 2024 • Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, XiaoLong Jiang, Yao Hu, Weidi Xie
This effectively enables the model to discern AI-generated images based on semantics or contextual information; Secondly, we select the highest frequency patches and the lowest frequency patches in the image, and compute the low-level patchwise features, aiming to detect AI-generated images by low-level artifacts, for example, noise pattern, anti-aliasing, etc.
1 code implementation • 6 May 2024 • Haihong Hao, Shuo Wang, Huixia Ben, Yanbin Hao, Yansong Wang, Weiwei Wang
Specifically, we first process ME video frames and special frames or data parallelly by our cascaded Unimodal Space-Time Attention (USTA) to establish connections between subtle facial movements and specific facial areas.
Micro Expression Recognition
Micro-Expression Recognition
+1
1 code implementation • CVPR 2024 • Zhicai Wang, Longhui Wei, Tan Wang, Heyu Chen, Yanbin Hao, Xiang Wang, Xiangnan He, Qi Tian
Text-to-image (T2I) generative models have recently emerged as a powerful tool, enabling the creation of photo-realistic images and giving rise to a multitude of applications.
no code implementations • 23 Mar 2024 • Xingyu Zhu, Shuo Wang, Jinda Lu, Yanbin Hao, Haifeng Liu, Xiangnan He
Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two samples from different categories with a blending factor.
no code implementations • 30 Jan 2024 • Pengyuan Zhou, Lin Wang, Zhi Liu, Yanbin Hao, Pan Hui, Sasu Tarkoma, Jussi Kangasharju
This paper offers an insightful examination of how currently top-trending AI technologies, i. e., generative artificial intelligence (Generative AI) and large language models (LLMs), are reshaping the field of video technology, including video generation, understanding, and streaming.
no code implementations • 2 Jan 2024 • Qinglong Huang, Haoran Li, Yong Liao, Yanbin Hao, Pengyuan Zhou
Neural Radiance Field (NeRF) has been proposed as an innovative advancement in 3D reconstruction techniques.
1 code implementation • 8 Dec 2023 • Fangzhou Song, Bin Zhu, Yanbin Hao, Shuo Wang
Leveraging on the remarkable capabilities of foundation models (i. e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart.
no code implementations • 18 Nov 2023 • Haoran Li, Long Ma, Haolin Shi, Yanbin Hao, Yong Liao, Lechao Cheng, Pengyuan Zhou
First, we segment the objects and the background in a multi-object image.
1 code implementation • 18 Sep 2023 • Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Tao Mei
In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos.
1 code implementation • 23 Aug 2023 • Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, Xiangnan He
Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance.
1 code implementation • 15 May 2023 • Fangwen Wu, Jingxuan He, Yufei Yin, Yanbin Hao, Gang Huang, Lechao Cheng
This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to highlight semantic regions in weakly supervised semantic segmentation.
Contrastive Learning
Weakly supervised Semantic Segmentation
+1
no code implementations • 17 Mar 2023 • Haoran Li, Pengyuan Zhou, Yihang Lin, Yanbin Hao, Haiyong Xie, Yong Liao
Video prediction is a complex time-series forecasting task with great potential in many use cases.
1 code implementation • CVPR 2023 • Zhicai Wang, Yanbin Hao, Tingting Mu, Ouxiang Li, Shuo Wang, Xiangnan He
It is well-known that zero-shot learning (ZSL) can suffer severely from the problem of domain shift, where the true and learned data distributions for the unseen classes do not match.
1 code implementation • CVPR 2023 • Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, Ting Yao
On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration.
Ranked #8 on
3D Human Pose Estimation
on MPI-INF-3DHP
1 code implementation • 15 Jul 2022 • Zhicai Wang, Yanbin Hao, Xingyu Gao, Hao Zhang, Shuo Wang, Tingting Mu, Xiangnan He
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
1 code implementation • 12 Jul 2022 • Hao Zhang, Lechao Cheng, Yanbin Hao, Chong-Wah Ngo
By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead ($\sim$2. 6\%).
1 code implementation • 20 Apr 2022 • Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu, Xiangnan He
Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to the utilization of perspective contexts.
1 code implementation • CVPR 2022 • Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Xiangnan He
By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities.
Ranked #3 on
Egocentric Activity Recognition
on EGTEA
3 code implementations • 5 Aug 2021 • Hao Zhang, Yanbin Hao, Chong-Wah Ngo
It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding.
no code implementations • 17 Mar 2021 • Zhenguang Liu, Kedi Lyu, Shuang Wu, Haipeng Chen, Yanbin Hao, Shouling Ji
Our method is compelling in that it enables manipulable motion prediction across activity types and allows customization of the human movement in a variety of fine-grained ways.
1 code implementation • ICCV 2021 • Zhenguang Liu, Pengxiang Su, Shuang Wu, Xuanjing Shen, Haipeng Chen, Yanbin Hao, Meng Wang
Predicting human motion from a historical pose sequence is at the core of many applications in computer vision.
no code implementations • LREC 2020 • Jinmeng Wu, Yanbin Hao
In addition to the context information captured at each word position, we incorporate a new quantity of context information jump to facilitate the attention weight formulation.