Search Results for author: Di Hu

Found 40 papers, 24 papers with code

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

no code implementations15 Mar 2024 Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

Denoising Image Generation

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

no code implementations9 Feb 2024 Zequn Yang, Yake Wei, Ce Liang, Di Hu

Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective.

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

1 code implementation13 Sep 2023 Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio?

CoLA Visual Localization

Enhancing Multimodal Cooperation via Fine-grained Modality Valuation

1 code implementation12 Sep 2023 Yake Wei, Ruoxuan Feng, Zihe Wang, Di Hu

One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities.

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

1 code implementation10 Aug 2023 Guangyao Li, Wenxuan Hou, Di Hu

Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.

Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +2

Supervised Knowledge May Hurt Novel Class Discovery Performance

1 code implementation6 Jun 2023 Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang

Next, by using the proposed transfer flow, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance.

Novel Class Discovery Semantic Similarity +1

Multi-Scale Attention for Audio Question Answering

1 code implementation29 May 2023 Guangyao Li, Yixin Xu, Di Hu

Audio question answering (AQA), acting as a widely used proxy task to explore scene understanding, has got more attention.

Audio Question Answering Question Answering +2

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

1 code implementation16 Apr 2023 Wenke Xia, Xingjian Li, Andong Deng, Haoyi Xiong, Dejing Dou, Di Hu

However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation.

Action Recognition Audio Tagging +3

Balanced Audiovisual Dataset for Imbalance Analysis

1 code implementation14 Feb 2023 Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu

We surprisingly find that: the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias.

Revisiting Pre-training in Audio-Visual Learning

1 code implementation7 Feb 2023 Ruoxuan Feng, Wenke Xia, Di Hu

Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning.

audio-visual learning

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

1 code implementation14 Jan 2023 Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu

Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.

Knowledge Graphs

A Closer Look at Novel Class Discovery from the Labeled Set

no code implementations19 Sep 2022 Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang

Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes.

Novel Class Discovery Semantic Similarity +1

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

no code implementations20 Aug 2022 Yake Wei, Di Hu, Yapeng Tian, Xuelong Li

A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected.

audio-visual learning Scene Understanding

Balanced Multimodal Learning via On-the-fly Gradient Modulation

1 code implementation CVPR 2022 Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, Di Hu

Multimodal learning helps to comprehensively understand the world, by integrating different senses.

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

1 code implementation CVPR 2022 Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

audio-visual learning Audio-visual Question Answering +4

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

no code implementations25 Mar 2022 Xinchi Zhou, Dongzhan Zhou, Wanli Ouyang, Hang Zhou, Ziwei Liu, Di Hu

Recent years have witnessed the success of deep learning on the visual sound separation task.

Towards Inadequately Pre-trained Models in Transfer Learning

no code implementations ICCV 2023 Andong Deng, Xingjian Li, Di Hu, Tianyang Wang, Haoyi Xiong, Chengzhong Xu

Based on the contradictory phenomenon between FE and FT that better feature extractor fails to be fine-tuned better accordingly, we conduct comprehensive analyses on features before softmax layer to provide insightful explanations.

Transfer Learning

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

1 code implementation13 Feb 2022 Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou

Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals.

Class-aware Sounding Objects Localization via Audiovisual Correspondence

1 code implementation22 Dec 2021 Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen

To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision.

Object object-detection +3

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

1 code implementation2 Aug 2021 Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu

By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery.

Cross-Modal Retrieval Representation Learning +1

Unsupervised Multi-Source Domain Adaptation for Person Re-Identification

1 code implementation CVPR 2021 Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, Errui Ding

Although achieving great success, most of them only use limited data from a single-source domain for model pre-training, making the rich labeled data insufficiently exploited.

Person Re-Identification Unsupervised Domain Adaptation

Model information as an analysis tool in deep learning

no code implementations1 Jan 2021 Xiao Zhang, Di Hu, Xingjian Li, Dejing Dou, Ji Wu

We demonstrate using model information as a general analysis tool to gain insight into problems that arise in deep learning.

Temporal Relational Modeling with Self-Supervision for Action Segmentation

1 code implementation14 Dec 2020 Dong Wang, Di Hu, Xingjian Li, Dejing Dou

The main reason is that large number of nodes (i. e., video frames) makes GCNs hard to capture and model temporal relations in videos.

Action Recognition Action Segmentation +1

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

no code implementations16 Oct 2020 Xingjian Li, Di Hu, Xuhong LI, Haoyi Xiong, Zhi Ye, Zhipeng Wang, Chengzhong Xu, Dejing Dou

Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples.

Disentanglement Transfer Learning

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

1 code implementation NeurIPS 2020 Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes.

Object Object Localization

Multiple Sound Sources Localization from Coarse to Fine

1 code implementation ECCV 2020 Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin

How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations.

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

1 code implementation ECCV 2020 Di Hu, Xuhong LI, Lichao Mou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, Dejing Dou

With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition.

Scene Recognition

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

1 code implementation14 May 2020 Di Hu, Lichao Mou, Qingzhong Wang, Junyu. Gao, Yuansheng Hua, Dejing Dou, Xiao Xiang Zhu

Visual crowd counting has been recently studied as a way to enable people counting in crowd scenes from images.

Crowd Counting

Curriculum Audiovisual Learning

no code implementations26 Jan 2020 Di Hu, Zheng Wang, Haoyi Xiong, Dong Wang, Feiping Nie, Dejing Dou

Associating sound and its producer in complex audiovisual scene is a challenging task, especially when we are lack of annotated training data.

Clustering

Listen to the Image

no code implementations CVPR 2019 Di Hu, Dong Wang, Xuelong. Li, Feiping Nie, Qi. Wang

different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.

Translation

Deep LDA Hashing

no code implementations8 Oct 2018 Di Hu, Feiping Nie, Xuelong. Li

The conventional supervised hashing methods based on classification do not entirely meet the requirements of hashing technique, but Linear Discriminant Analysis (LDA) does.

Dense Multimodal Fusion for Hierarchically Joint Representation

no code implementations8 Oct 2018 Di Hu, Feiping Nie, Xuelong. Li

Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities.

Cross-Modal Retrieval Retrieval +2

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

1 code implementation CVPR 2019 Di Hu, Feiping Nie, Xuelong. Li

And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion.

Clustering

Image2song: Song Retrieval via Bridging Image Content and Lyric Words

no code implementations ICCV 2017 Xuelong. Li, Di Hu, Xiaoqiang Lu

Image is usually taken for expressing some kinds of emotions or purposes, such as love, celebrating Christmas.

Retrieval TAG

Deep Binary Reconstruction for Cross-modal Hashing

1 code implementation17 Aug 2017 Xuelong. Li, Di Hu, Feiping Nie

Based on the analysis, we provide a so-called Deep Binary Reconstruction (DBRC) network that can directly learn the binary hashing codes in an unsupervised fashion.

Cross-Modal Retrieval Retrieval

Temporal Multimodal Learning in Audiovisual Speech Recognition

no code implementations CVPR 2016 Di Hu, Xuelong. Li, Xiaoqiang Lu

Recently, audiovisual speech recognition based the MRBM has attracted much attention, and the MRBM shows its effectiveness in learning the joint representation across audiovisual modalities.

Multimodal Deep Learning speech-recognition +1

Cannot find the paper you are looking for? You can Submit a new open access paper.