no code implementations • 12 Dec 2024 • Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See
Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field.
1 code implementation • 15 Oct 2024 • Yake Wei, Di Hu, Henghui Du, Ji-Rong Wen
Then, On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training.
no code implementations • 6 Aug 2024 • Jingxian Lu, Wenke Xia, Dong Wang, Zhigang Wang, Bin Zhao, Di Hu, Xuelong Li
Within the intervals between semantic key states, optical flow is employed to capture motion key states to understand the mechanisms of "how to do".
no code implementations • 2 Aug 2024 • Ruoxuan Feng, Di Hu, Wenke Ma, Xuelong Li
Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment.
1 code implementation • 30 Jul 2024 • Guangyao Li, Henghui Du, Di Hu
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +2
no code implementations • 26 Jul 2024 • Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen
To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
no code implementations • 23 Jul 2024 • Peiwen Sun, Honggang Zhang, Di Hu
For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries.
1 code implementation • 16 Jul 2024 • Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu
Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes.
1 code implementation • 15 Jul 2024 • Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu
The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues.
no code implementations • 15 Jul 2024 • Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues.
1 code implementation • 12 Jul 2024 • Yake Wei, Siwei Li, Ruoxuan Feng, Di Hu
In this way, the over-emphasizing of scarcely informative modalities is avoided.
1 code implementation • 28 Jun 2024 • Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ze-Feng Gao, Yueguo Chen, Weizheng Lu, Ji-Rong Wen
This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters.
1 code implementation • 1 Jun 2024 • Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li
To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation. Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively.
1 code implementation • 28 May 2024 • Yake Wei, Di Hu
However, in this paper, we identify the previously ignored gradient conflict between multimodal and unimodal learning objectives, potentially misleading the unimodal encoder optimization.
no code implementations • 27 Apr 2024 • Qingyang Zhang, Yake Wei, Zongbo Han, Huazhu Fu, Xi Peng, Cheng Deng, QinGhua Hu, Cai Xu, Jie Wen, Di Hu, Changqing Zhang
Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis.
no code implementations • 15 Mar 2024 • Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li
Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.
1 code implementation • 9 Feb 2024 • Zequn Yang, Yake Wei, Ce Liang, Di Hu
Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective.
2 code implementations • 6 Nov 2023 • Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, Xuelong Li
Generalizable articulated object manipulation is essential for home-assistant robots.
1 code implementation • 13 Sep 2023 • Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li
Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio?
1 code implementation • CVPR 2024 • Yake Wei, Ruoxuan Feng, Zihe Wang, Di Hu
One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities.
1 code implementation • 10 Aug 2023 • Guangyao Li, Wenxuan Hou, Di Hu
Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.
Ranked #2 on Audio-Visual Question Answering (AVQA) on AVQA
Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +2
1 code implementation • 6 Jun 2023 • Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang
Next, by using the proposed transfer flow, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance.
1 code implementation • 29 May 2023 • Guangyao Li, Yixin Xu, Di Hu
Audio question answering (AQA), acting as a widely used proxy task to explore scene understanding, has got more attention.
1 code implementation • 16 Apr 2023 • Wenke Xia, Xingjian Li, Andong Deng, Haoyi Xiong, Dejing Dou, Di Hu
However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation.
1 code implementation • 14 Feb 2023 • Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu
We surprisingly find that: the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias.
1 code implementation • 7 Feb 2023 • Ruoxuan Feng, Wenke Xia, Di Hu
Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning.
1 code implementation • 14 Jan 2023 • Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu
Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.
no code implementations • 19 Sep 2022 • Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang
Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes.
no code implementations • 20 Aug 2022 • Yake Wei, Di Hu, Yapeng Tian, Xuelong Li
A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected.
no code implementations • 10 Aug 2022 • Yingzi Fan, Longfei Han, Yue Zhang, Lechao Cheng, Chen Xia, Di Hu
The domain discrepancy induces to performance degradation on target testing data for CNN models.
2 code implementations • CVPR 2022 • Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, Di Hu
Multimodal learning helps to comprehensively understand the world, by integrating different senses.
1 code implementation • CVPR 2022 • Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.
Ranked #5 on Audio-visual Question Answering on MUSIC-AVQA
no code implementations • 25 Mar 2022 • Xinchi Zhou, Dongzhan Zhou, Wanli Ouyang, Hang Zhou, Ziwei Liu, Di Hu
Recent years have witnessed the success of deep learning on the visual sound separation task.
no code implementations • ICCV 2023 • Andong Deng, Xingjian Li, Di Hu, Tianyang Wang, Haoyi Xiong, Chengzhong Xu
Based on the contradictory phenomenon between FE and FT that better feature extractor fails to be fine-tuned better accordingly, we conduct comprehensive analyses on features before softmax layer to provide insightful explanations.
1 code implementation • 13 Feb 2022 • Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou
Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals.
1 code implementation • 22 Dec 2021 • Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen
To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision.
1 code implementation • 2 Aug 2021 • Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu
By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery.
Ranked #2 on Cross-Modal Retrieval on SoundingEarth
no code implementations • 2 Jun 2021 • Ziyun Li, Xinshao Wang, Di Hu, Neil M. Robertson, David A. Clifton, Christoph Meinel, Haojin Yang
Additionally, CMD covers two special cases: zero-knowledge and all knowledge, leading to a unified MKD framework.
1 code implementation • CVPR 2021 • Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, Errui Ding
Although achieving great success, most of them only use limited data from a single-source domain for model pre-training, making the rich labeled data insufficiently exploited.
1 code implementation • CVPR 2021 • Yapeng Tian, Di Hu, Chenliang Xu
There are rich synchronized audio and visual events in our daily life.
no code implementations • 1 Jan 2021 • Xiao Zhang, Di Hu, Xingjian Li, Dejing Dou, Ji Wu
We demonstrate using model information as a general analysis tool to gain insight into problems that arise in deep learning.
1 code implementation • 14 Dec 2020 • Dong Wang, Di Hu, Xingjian Li, Dejing Dou
The main reason is that large number of nodes (i. e., video frames) makes GCNs hard to capture and model temporal relations in videos.
Ranked #24 on Action Segmentation on 50 Salads
no code implementations • 16 Oct 2020 • Xingjian Li, Di Hu, Xuhong LI, Haoyi Xiong, Zhi Ye, Zhipeng Wang, Chengzhong Xu, Dejing Dou
Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples.
1 code implementation • NeurIPS 2020 • Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou
First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes.
1 code implementation • ECCV 2020 • Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations.
1 code implementation • ECCV 2020 • Di Hu, Xuhong LI, Lichao Mou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, Dejing Dou
With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition.
1 code implementation • 14 May 2020 • Di Hu, Lichao Mou, Qingzhong Wang, Junyu. Gao, Yuansheng Hua, Dejing Dou, Xiao Xiang Zhu
Visual crowd counting has been recently studied as a way to enable people counting in crowd scenes from images.
no code implementations • 26 Jan 2020 • Di Hu, Zheng Wang, Haoyi Xiong, Dong Wang, Feiping Nie, Dejing Dou
Associating sound and its producer in complex audiovisual scene is a challenging task, especially when we are lack of annotated training data.
no code implementations • CVPR 2019 • Di Hu, Dong Wang, Xuelong. Li, Feiping Nie, Qi. Wang
different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.
no code implementations • 8 Oct 2018 • Di Hu, Feiping Nie, Xuelong. Li
The conventional supervised hashing methods based on classification do not entirely meet the requirements of hashing technique, but Linear Discriminant Analysis (LDA) does.
no code implementations • 8 Oct 2018 • Di Hu, Feiping Nie, Xuelong. Li
Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities.
1 code implementation • CVPR 2019 • Di Hu, Feiping Nie, Xuelong. Li
And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion.
no code implementations • ICCV 2017 • Xuelong. Li, Di Hu, Xiaoqiang Lu
Image is usually taken for expressing some kinds of emotions or purposes, such as love, celebrating Christmas.
1 code implementation • 17 Aug 2017 • Xuelong. Li, Di Hu, Feiping Nie
Based on the analysis, we provide a so-called Deep Binary Reconstruction (DBRC) network that can directly learn the binary hashing codes in an unsupervised fashion.
no code implementations • CVPR 2016 • Di Hu, Xuelong. Li, Xiaoqiang Lu
Recently, audiovisual speech recognition based the MRBM has attracted much attention, and the MRBM shows its effectiveness in learning the joint representation across audiovisual modalities.