no code implementations • 21 Apr 2025 • Junchen Fu, Xuri Ge, Xin Xin, HaiTao Yu, Yue Feng, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose
Multimodal representation learning has garnered significant attention in the AI community, largely due to the success of large pre-trained multimodal foundation models like LLaMA, GPT, Mistral, and CLIP.
1 code implementation • 14 Apr 2025 • Junchen Fu, Yongxin Ni, Joemon M. Jose, Ioannis Arapakis, Kaiwen Zheng, Youhua Li, Xuri Ge
Leveraging the fully decoupled side adapter-based paradigm, CROSSAN achieves high efficiency while enabling cross-modal learning across diverse modalities.
no code implementations • 14 Apr 2025 • Kaiwen Zheng, Xuri Ge, Junchen Fu, Jun Peng, Joemon M. Jose
First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o.
no code implementations • 18 Feb 2025 • Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose
(iii) How well do various LLMs and video generators perform in the popular micro-video generation task?
no code implementations • 10 Dec 2024 • Fuhai Chen, Pengpeng Huang, Xuri Ge, Jie Huang, Zishuo Bao
However, multimodal sentiment analysis is affected by unimodal data bias, e. g., text sentiment is misleading due to explicit sentiment semantic, leading to low accuracy in the final sentiment classification.
no code implementations • 5 Nov 2024 • Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Kaiwen Zheng, Yongxin Ni, Joemon M. Jose
To overcome this, we developed IISAN-Versa, a versatile plug-and-play architecture compatible with both symmetrical and asymmetrical MFMs.
no code implementations • 27 Oct 2024 • Zihan Wang, Xuri Ge, Joemon M. Jose, HaiTao Yu, Weizhi Ma, Zhaochun Ren, Xin Xin
At the end of the workshop, we aim to have a clearer understanding of how to improve the reliability and applicability of RAG with more robust information retrieval and language generation.
no code implementations • 11 Oct 2024 • Songpei Xu, Xuri Ge, Chaitanya Kaul, Roderick Murray-Smith
We present a novel Hand-pose Embedding Interactive System (HpEIS) as a virtual sensor, which maps users' flexible hand poses to a two-dimensional visual space using a Variational Autoencoder (VAE) trained on a variety of hand poses.
no code implementations • 1 Aug 2024 • Xuri Ge, Junchen Fu, Fuhai Chen, Shan An, Nicu Sebe, Joemon M. Jose
Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis.
no code implementations • 5 Jun 2024 • Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Jie Wang, Joemon M. Jose
In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling.
no code implementations • 26 May 2024 • Tong Shi, Xuri Ge, Joemon M. Jose, Nicolas Pugeault, Paul Henderson
Capturing complex temporal relationships between video and audio modalities is vital for Audio-Visual Emotion Recognition (AVER).
1 code implementation • 26 Apr 2024 • Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, Joemon M. Jose
In this paper, we propose a novel visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval.
Ranked #1 on
Cross-Modal Retrieval
on MSCOCO
2 code implementations • 2 Apr 2024 • Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, Joemon M. Jose
This is also a notable improvement over the Adapter and LoRA, which require 37-39 GB GPU memory and 350-380 seconds per epoch for training.
no code implementations • 23 Feb 2024 • Zijun Long, Xuri Ge, Richard McCreadie, Joemon Jose
Text-to-image retrieval aims to find the relevant images based on a text query, which is important in various use-cases, such as digital libraries, e-commerce, and multimedia databases.
no code implementations • 6 Jul 2023 • Fuxiang Tao, Wei Ma, Xuri Ge, Anna Esposito, Alessandro Vinciarelli
The results show that the models used in the experiments improve in terms of training speed and performance when fed with feature correlation matrices rather than with feature vectors.
no code implementations • 17 Oct 2022 • Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Joemon M. Jose
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
no code implementations • 4 Apr 2022 • Xuri Ge, Joemon M. Jose, Songpei Xu, Xiao Liu, Hu Han
While the region-level feature learning from local face patches features via graph neural network can encode the correlation across different AUs, the pixel-wise and channel-wise feature learning via graph attention network can enhance the discrimination ability of AU features from global face features.
no code implementations • 12 Mar 2022 • Fuhai Chen, Rongrong Ji, Chengpeng Dai, Xuri Ge, Shengchuang Zhang, Xiaojing Ma, Yue Gao
Echocardiography is widely used to clinical practice for diagnosis and treatment, e. g., on the common congenital heart defects.
no code implementations • 12 Mar 2022 • Fuhai Chen, Xuri Ge, Xiaoshuai Sun, Yue Gao, Jianzhuang Liu, Fufeng Chen, Wenjie Li
The key of referring expression comprehension lies in capturing the cross-modal visual-linguistic relevance.
no code implementations • 3 Mar 2022 • Xuri Ge, Joemon M. Jose, Pengcheng Wang, Arunachalam Iyer, Xiao Liu, Hu Han
In this paper, we propose a novel Adaptive Local-Global Relational Network (ALGRNet) for facial AU detection and use it to classify facial paralysis severity.
no code implementations • 5 Aug 2021 • Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, Xiao Liu
In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e. g., "dog $\to$ play $\to$ ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities.
no code implementations • NeurIPS 2019 • Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang
To model these two inherent diversities in image captioning, we propose a Variational Structured Semantic Inferring model (termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema.