no code implementations • 7 Mar 2025 • Jiaxing Zhao, Xihan Wei, Liefeng Bo
In this work, we present the first application of Reinforcement Learning with Verifiable Reward (RLVR) to an Omni-multimodal large language model in the context of emotion recognition, a task where both visual and audio modalities play crucial roles.
no code implementations • 25 Jan 2025 • Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Weixuan Chen, Xihan Wei, Liefeng Bo
In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial.
1 code implementation • 14 Jan 2025 • Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei
Facial expression captioning has found widespread application across various domains.
no code implementations • 9 Jan 2025 • Jiaxing Zhao, Boyuan Sun, Xiang Chen, Xihan Wei, Qibin Hou
In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model.
no code implementations • 17 Apr 2024 • Jiaxing Zhao, Peng Zheng, Rui Ma
To address this issue, we propose D-Aug, a LiDAR data augmentation method tailored for augmenting dynamic scenes.
no code implementations • CVPR 2020 • Jiaxing Zhao
In the open-set object detection, the alignment of visual and text features is one of the most important factors affecting the final detection performance.