no code implementations • 27 Feb 2025 • Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie
GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively.
no code implementations • 25 Jan 2025 • Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, Zheng Zhang
Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length.
1 code implementation • 20 Jan 2025 • Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang
Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.
2 code implementations • 16 Dec 2024 • Wenyun Li, Zheng Zhang, Xiangyuan Lan, Dongmei Jiang
Extensive experiments on two high-resolution face recognition datasets validate that our TCA$^2$ method can generate natural text-guided adversarial impersonation faces with high transferability.
no code implementations • 1 Dec 2024 • Yan Li, Yifei Xing, Xiangyuan Lan, Xin Li, Haifeng Chen, Dongmei Jiang
Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
no code implementations • 19 Nov 2024 • Zhehan Kan, Ce Zhang, Zihan Liao, Yapeng Tian, Wenming Yang, Junyuan Xiao, Xu Li, Dongmei Jiang, YaoWei Wang, Qingmin Liao
Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems.
no code implementations • 8 Oct 2024 • Yifei Xing, Xiangyuan Lan, Ruiping Wang, Dongmei Jiang, Wenjun Huang, Qingfang Zheng, YaoWei Wang
In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information.
no code implementations • 4 Sep 2024 • Xing Lan, Jian Xue, Ji Qi, Dongmei Jiang, Ke Lu, Tat-Seng Chua
Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion.
Facial Expression Recognition
Facial Expression Recognition (FER)
1 code implementation • 7 Aug 2024 • Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie
On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft.
1 code implementation • 10 Jul 2024 • Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, YaoWei Wang, Xiangyuan Lan, Xiaodan Liang
To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
Ranked #5 on
Zero-Shot Object Detection
on MSCOCO
(AP metric, using extra
training data)
1 code implementation • 28 Apr 2024 • Yong Dai, Xiaopeng Hong, Yabin Wang, Zhiheng Ma, Dongmei Jiang, YaoWei Wang
In contrast to conventional methods that employ hard prompt selection, PGM assigns different coefficients to prompts from a fixed-sized pool of prompts and generates tailored prompts.
1 code implementation • CVPR 2024 • Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, YaoWei Wang, Chun Yuan
Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations.
1 code implementation • 25 Feb 2024 • Feng Lu, Shuting Dong, Lijun Zhang, Bingxi Liu, Xiangyuan Lan, Dongmei Jiang, Chun Yuan
Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching.
no code implementations • 12 Jan 2024 • Zaijing Li, Gongwei Chen, Rui Shao, Yuquan Xie, Dongmei Jiang, Liqiang Nie
In this paper, we propose the Emotional Chain-of-Thought (ECoT), a plug-and-play prompting method that enhances the performance of LLMs on various emotional generation tasks by aligning with human emotional intelligence guidelines.
no code implementations • 13 Aug 2023 • Yi Zhang, Jitao Sang, Junyang Wang, Dongmei Jiang, YaoWei Wang
To this end, we propose \emph{Shortcut Debiasing}, to first transfer the target task's learning of bias attributes from bias features to shortcut features, and then employ causal intervention to eliminate shortcut features during inference.
1 code implementation • ICCV 2023 • Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, YaoWei Wang, JianGuo Zhang
Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region.
no code implementations • 20 Mar 2023 • Fan Cui, Liyong Guo, Lang He, Jiyao Liu, Ercheng Pei, Yujun Wang, Dongmei Jiang
Electroencephalography (EEG) plays a vital role in detecting how brain responses to different stimulus.
no code implementations • 13 Oct 2021 • Jiyao Liu, Yanxi Zhao, Hao Wu, Dongmei Jiang
The proposed module, denoted by PST-Attention, consists of Positional, Spectral and Temporal Attention modules to explore more discriminative EEG features.
no code implementations • 20 Mar 2021 • Congqi Cao, Yue Lu, Yifan Zhang, Dongmei Jiang, Yanning Zhang
Inspired from 2D criss-cross attention used in segmentation task, we propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range spatiotemporal contextual information in video for action recognition.
no code implementations • 28 Nov 2016 • Meshia Cédric Oveneke, Mitchel Aliosha-Perez, Yong Zhao, Dongmei Jiang, Hichem Sahli
The omnipresence of deep learning architectures such as deep convolutional neural networks (CNN)s is fueled by the synergistic combination of ever-increasing labeled datasets and specialized hardware.