Search Results for author: Di Hu

Found 40 papers, 24 papers with code

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

no code implementations • 15 Mar 2024 • Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

Denoising Image Generation

Paper
Add Code

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

no code implementations • 9 Feb 2024 • Zequn Yang, Yake Wei, Ce Liang, Di Hu

Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective.

Paper
Add Code

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

2 code implementations • 6 Nov 2023 • Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, Xuelong Li

Generalizable articulated object manipulation is essential for home-assistant robots.

Imitation Learning In-Context Learning +1

Paper
Code

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

1 code implementation • 13 Sep 2023 • Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio?

CoLA Visual Localization

Paper
Code

Enhancing Multimodal Cooperation via Fine-grained Modality Valuation

1 code implementation • 12 Sep 2023 • Yake Wei, Ruoxuan Feng, Zihe Wang, Di Hu

One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities.

Paper
Code

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

1 code implementation • 10 Aug 2023 • Guangyao Li, Wenxuan Hou, Di Hu

Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest.

Ranked #2 on Audio-Visual Question Answering (AVQA) on AVQA

Audio-visual Question Answering Audio-Visual Question Answering (AVQA) +2

Paper
Code

Supervised Knowledge May Hurt Novel Class Discovery Performance

1 code implementation • 6 Jun 2023 • Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang

Next, by using the proposed transfer flow, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance.

Novel Class Discovery Semantic Similarity +1

Paper
Code

Multi-Scale Attention for Audio Question Answering

1 code implementation • 29 May 2023 • Guangyao Li, Yixin Xu, Di Hu

Audio question answering (AQA), acting as a widely used proxy task to explore scene understanding, has got more attention.

Audio Question Answering Question Answering +2

Paper
Code

Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

1 code implementation • 16 Apr 2023 • Wenke Xia, Xingjian Li, Andong Deng, Haoyi Xiong, Dejing Dou, Di Hu

However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation.

Action Recognition Audio Tagging +3

Paper
Code

Balanced Audiovisual Dataset for Imbalance Analysis

1 code implementation • 14 Feb 2023 • Wenke Xia, Xu Zhao, Xincheng Pang, Changqing Zhang, Di Hu

We surprisingly find that: the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias.

198

Paper
Code

Revisiting Pre-training in Audio-Visual Learning

1 code implementation • 7 Feb 2023 • Ruoxuan Feng, Wenke Xia, Di Hu

Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning.

audio-visual learning

Paper
Code

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

1 code implementation • 14 Jan 2023 • Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu

Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.

Knowledge Graphs

Paper
Code

A Closer Look at Novel Class Discovery from the Labeled Set

no code implementations • 19 Sep 2022 • Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang

Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes.

Novel Class Discovery Semantic Similarity +1

Paper
Add Code

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

no code implementations • 20 Aug 2022 • Yake Wei, Di Hu, Yapeng Tian, Xuelong Li

A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected.

audio-visual learning Scene Understanding

Paper
Add Code

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

no code implementations • 10 Aug 2022 • Yingzi Fan, Longfei Han, Yue Zhang, Lechao Cheng, Chen Xia, Di Hu

The domain discrepancy induces to performance degradation on target testing data for CNN models.

Saliency Prediction Unsupervised Domain Adaptation

Paper
Add Code

Balanced Multimodal Learning via On-the-fly Gradient Modulation

1 code implementation • CVPR 2022 • Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, Di Hu

Multimodal learning helps to comprehensively understand the world, by integrating different senses.

198

Paper
Code

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

1 code implementation • CVPR 2022 • Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos.

Ranked #5 on Audio-visual Question Answering on MUSIC-AVQA

audio-visual learning Audio-visual Question Answering +4

Paper
Code

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

no code implementations • 25 Mar 2022 • Xinchi Zhou, Dongzhan Zhou, Wanli Ouyang, Hang Zhou, Ziwei Liu, Di Hu

Recent years have witnessed the success of deep learning on the visual sound separation task.

Paper
Add Code

Towards Inadequately Pre-trained Models in Transfer Learning

no code implementations • ICCV 2023 • Andong Deng, Xingjian Li, Di Hu, Tianyang Wang, Haoyi Xiong, Chengzhong Xu

Based on the contradictory phenomenon between FE and FT that better feature extractor fails to be fine-tuned better accordingly, we conduct comprehensive analyses on features before softmax layer to provide insightful explanations.

Transfer Learning

Paper
Add Code

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

1 code implementation • 13 Feb 2022 • Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou

Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals.

Paper
Code

Class-aware Sounding Objects Localization via Audiovisual Correspondence

1 code implementation • 22 Dec 2021 • Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen

To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision.

Object object-detection +3

Paper
Code

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

1 code implementation • 2 Aug 2021 • Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu

By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery.

Ranked #2 on Cross-Modal Retrieval on SoundingEarth

Cross-Modal Retrieval Representation Learning +1

Paper
Code

Not All Knowledge Is Created Equal: Mutual Distillation of Confident Knowledge

no code implementations • 2 Jun 2021 • Ziyun Li, Xinshao Wang, Di Hu, Neil M. Robertson, David A. Clifton, Christoph Meinel, Haojin Yang

Additionally, CMD covers two special cases: zero-knowledge and all knowledge, leading to a unified MKD framework.

Knowledge Distillation Memorization

Paper
Add Code

Unsupervised Multi-Source Domain Adaptation for Person Re-Identification

1 code implementation • CVPR 2021 • Zechen Bai, Zhigang Wang, Jian Wang, Di Hu, Errui Ding

Although achieving great success, most of them only use limited data from a single-source domain for model pre-training, making the rich labeled data insufficiently exploited.

Person Re-Identification Unsupervised Domain Adaptation

Paper
Code

Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation

1 code implementation • CVPR 2021 • Yapeng Tian, Di Hu, Chenliang Xu

There are rich synchronized audio and visual events in our daily life.

Object Visual Grounding

Paper
Code

Model information as an analysis tool in deep learning

no code implementations • 1 Jan 2021 • Xiao Zhang, Di Hu, Xingjian Li, Dejing Dou, Ji Wu

We demonstrate using model information as a general analysis tool to gain insight into problems that arise in deep learning.

Paper
Add Code

Temporal Relational Modeling with Self-Supervision for Action Segmentation

1 code implementation • 14 Dec 2020 • Dong Wang, Di Hu, Xingjian Li, Dejing Dou

The main reason is that large number of nodes (i. e., video frames) makes GCNs hard to capture and model temporal relations in videos.

Ranked #23 on Action Segmentation on Breakfast

Action Recognition Action Segmentation +1

Paper
Code

Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement

no code implementations • 16 Oct 2020 • Xingjian Li, Di Hu, Xuhong LI, Haoyi Xiong, Zhi Ye, Zhipeng Wang, Chengzhong Xu, Dejing Dou

Fine-tuning deep neural networks pre-trained on large scale datasets is one of the most practical transfer learning paradigm given limited quantity of training samples.

Disentanglement Transfer Learning

Paper
Add Code

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

1 code implementation • NeurIPS 2020 • Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes.

Object Object Localization

Paper
Code

Multiple Sound Sources Localization from Coarse to Fine

1 code implementation • ECCV 2020 • Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin

How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations.

Paper
Code

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

1 code implementation • ECCV 2020 • Di Hu, Xuhong LI, Lichao Mou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, Dejing Dou

With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition.

Scene Recognition

Paper
Code

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

1 code implementation • 14 May 2020 • Di Hu, Lichao Mou, Qingzhong Wang, Junyu. Gao, Yuansheng Hua, Dejing Dou, Xiao Xiang Zhu

Visual crowd counting has been recently studied as a way to enable people counting in crowd scenes from images.

Crowd Counting

Paper
Code

Curriculum Audiovisual Learning

no code implementations • 26 Jan 2020 • Di Hu, Zheng Wang, Haoyi Xiong, Dong Wang, Feiping Nie, Dejing Dou

Associating sound and its producer in complex audiovisual scene is a challenging task, especially when we are lack of annotated training data.

Clustering

Paper
Add Code

Listen to the Image

no code implementations • CVPR 2019 • Di Hu, Dong Wang, Xuelong. Li, Feiping Nie, Qi. Wang

different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.

Translation

Paper
Add Code

Deep LDA Hashing

no code implementations • 8 Oct 2018 • Di Hu, Feiping Nie, Xuelong. Li

The conventional supervised hashing methods based on classification do not entirely meet the requirements of hashing technique, but Linear Discriminant Analysis (LDA) does.

Paper
Add Code

Dense Multimodal Fusion for Hierarchically Joint Representation

no code implementations • 8 Oct 2018 • Di Hu, Feiping Nie, Xuelong. Li

Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities.

Cross-Modal Retrieval Retrieval +2

Paper
Add Code

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

1 code implementation • CVPR 2019 • Di Hu, Feiping Nie, Xuelong. Li

And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion.

Clustering

Paper
Code

Image2song: Song Retrieval via Bridging Image Content and Lyric Words

no code implementations • ICCV 2017 • Xuelong. Li, Di Hu, Xiaoqiang Lu

Image is usually taken for expressing some kinds of emotions or purposes, such as love, celebrating Christmas.

Retrieval TAG

Paper
Add Code

Deep Binary Reconstruction for Cross-modal Hashing

1 code implementation • 17 Aug 2017 • Xuelong. Li, Di Hu, Feiping Nie

Based on the analysis, we provide a so-called Deep Binary Reconstruction (DBRC) network that can directly learn the binary hashing codes in an unsupervised fashion.

Cross-Modal Retrieval Retrieval

Paper
Code

Temporal Multimodal Learning in Audiovisual Speech Recognition

no code implementations • CVPR 2016 • Di Hu, Xuelong. Li, Xiaoqiang Lu

Recently, audiovisual speech recognition based the MRBM has attracted much attention, and the MRBM shows its effectiveness in learning the joint representation across audiovisual modalities.

Multimodal Deep Learning speech-recognition +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.