Search Results for author: JianHua Tao

Found 76 papers, 23 papers with code

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

no code implementations18 Mar 2025 Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, JianHua Tao, Tao Yu

To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios.

Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking

no code implementations4 Feb 2025 Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, JianHua Tao

Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning.

Computational Efficiency Multimodal Reasoning +1

DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

no code implementations29 Jan 2025 Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, JianHua Tao, Feihu Che

Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs.

Language Modeling Language Modelling

MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection

no code implementations12 Jan 2025 Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, JianHua Tao, Xuefei Liu, Guanjun Li

Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information.

Fake News Detection

Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio

no code implementations2 Dec 2024 Xinrui Yan, Jiangyan Yi, JianHua Tao, Yujie Chen, Hao Gu, Guanjun Li, Junzuo Zhou, Yong Ren, Tao Xu

To address the issues, we propose a novel framework for open set model attribution of deepfake audio with rejection threshold adaptation (ReTA).

Face Swapping

Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

no code implementations27 Nov 2024 Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, JianHua Tao

In-context Learning (ICL) enables large language models (LLMs) to tackle downstream tasks through sophisticated prompting and high-quality demonstrations.

In-Context Learning Math +1

DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection

1 code implementation15 Oct 2024 Sheng Yan, Cunhang Fan, Hongyu Zhang, Xiaoke Yang, JianHua Tao, Zhao Lv

To address these issues, this paper proposes a dual attention refinement network with spatiotemporal construction for AAD, named DARNet, which consists of the spatiotemporal construction module, dual attention refinement module, and feature fusion \& classifier module.

EEG

Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

no code implementations14 Sep 2024 Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, JianHua Tao, Tao Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, Xin Qi, Guanjun Li, Zizheng Yang

Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references.

Audio Generation Style Transfer

Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models

no code implementations24 Aug 2024 Jinyang Wu, Feihu Che, Chuyuan Zhang, JianHua Tao, Shuai Zhang, Pengpeng Shao

Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs).

RAG Retrieval

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

no code implementations11 Aug 2024 Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, JianHua Tao

For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

no code implementations9 Aug 2024 Jiangyan Yi, Chu Yuan Zhang, JianHua Tao, Chenglong Wang, Xinrui Yan, Yong Ren, Hao Gu, Junzuo Zhou

The growing prominence of the field of audio deepfake detection is driven by its wide range of applications, notably in protecting the public from potential fraud and other malicious activities, prompting the need for greater attention and research in this area.

Audio Deepfake Detection Face Swapping

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

no code implementations7 Jul 2024 Ruibo Fu, Xin Qi, Zhengqi Wen, JianHua Tao, Tao Wang, Chunyu Qiang, Zhiyong Wang, Yi Lu, Xiaopeng Wang, Shuchen Shi, Yukun Liu, Xuefei Liu, Shuai Zhang

The results indicate that the ASRRL method significantly outperforms traditional fine-tuning approaches, achieving higher speaker similarity and better overall speech quality with limited reference speeches.

Sentence text-to-speech +1

Fake News Detection and Manipulation Reasoning via Large Vision-Language Models

no code implementations2 Jul 2024 Ruihan Jin, Ruibo Fu, Zhengqi Wen, Shuai Zhang, Yukun Liu, JianHua Tao

To support the research, we introduce a benchmark for fake news detection and manipulation reasoning, referred to as Human-centric and Fact-related Fake News (HFFN).

Binary Classification Fake News Detection +2

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

1 code implementation15 Jun 2024 Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, JianHua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation.

AudioCaps Image Generation

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

no code implementations12 Jun 2024 Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, JianHua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform.

Audio Deepfake Detection Audio Generation +4

Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

no code implementations5 Jun 2024 Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, JianHua Tao

For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores.

Audio Deepfake Detection Face Swapping

Can large language models understand uncommon meanings of common words?

no code implementations9 May 2024 Jinyang Wu, Feihu Che, Xinxin Zheng, Shuai Zhang, Ruihan Jin, Shuai Nie, Pengpeng Shao, JianHua Tao

Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents.

Natural Language Understanding

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

no code implementations24 Apr 2024 Xinxin Zheng, Feihu Che, Jinyang Wu, Shuai Zhang, Shuai Nie, Kang Liu, JianHua Tao

Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks.

Hallucination Question Answering +2

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

1 code implementation19 Jan 2024 Cunhang Fan, Yujie Chen, Jun Xue, Yonghui Kong, JianHua Tao, Zhao Lv

This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models.

Knowledge Graph Completion Language Modelling +1

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

1 code implementation11 Jan 2024 Licai Sun, Zheng Lian, Bin Liu, JianHua Tao

Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines.

Contrastive Learning Dynamic Facial Expression Recognition +3

SVFAP: Self-supervised Video Facial Affect Perceiver

1 code implementation31 Dec 2023 Licai Sun, Zheng Lian, Kexin Wang, Yu He, Mingyu Xu, Haiyang Sun, Bin Liu, JianHua Tao

Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction.

Dynamic Facial Expression Recognition Emotion Recognition +2

What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

1 code implementation15 Dec 2023 Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, JianHua Tao

The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms.

Audio Deepfake Detection Continual Learning +3

GPT-4V with Emotion: A Zero-shot Benchmark for Generalized Emotion Recognition

1 code implementation7 Dec 2023 Zheng Lian, Licai Sun, Haiyang Sun, Kang Chen, Zhuofan Wen, Hao Gu, Bin Liu, JianHua Tao

To bridge this gap, we present the quantitative evaluation results of GPT-4V on 21 benchmark datasets covering 6 tasks: visual sentiment analysis, tweet sentiment analysis, micro-expression recognition, facial emotion recognition, dynamic facial emotion recognition, and multimodal emotion recognition.

Facial Emotion Recognition Micro Expression Recognition +3

DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

no code implementations7 Sep 2023 Cunhang Fan, Hongyu Zhang, Wei Huang, Jun Xue, JianHua Tao, Jiangyan Yi, Zhao Lv, Xiaopei Wu

Specifically, to effectively represent the non-Euclidean properties of EEG signals, dynamical graph convolutional networks are applied to represent the graph structure of EEG signals, which can also extract crucial features related to auditory spatial attention in EEG signals.

EEG

Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

1 code implementation7 Aug 2023 Xiaohui Zhang, Jiangyan Yi, JianHua Tao, Chenglong Wang, Chuyuan Zhang

The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of genuine audio across different datasets.

Continual Learning Speech Emotion Recognition

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

no code implementations9 Jun 2023 Haogeng Liu, Tao Wang, Jie Cao, Ran He, JianHua Tao

When decreasing the number of sampling steps (i. e., the number of line segments used to fit the path), the ease of fitting straight lines compared to curves allows us to generate higher quality samples from a random noise with fewer iterations.

Denoising Speech Synthesis

Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

no code implementations9 Jun 2023 Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, JianHua Tao, Le Xu, Ruibo Fu

Self-supervised speech models are a rapidly developing research topic in fake audio detection.

Adaptive Fake Audio Detection with Low-Rank Model Squeezing

no code implementations8 Jun 2023 Xiaohui Zhang, Jiangyan Yi, JianHua Tao, Chenlong Wang, Le Xu, Ruibo Fu

During the inference stage, these adaptation matrices are combined with the existing model to generate the final prediction output.

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

no code implementations3 May 2023 Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, JianHua Tao, Jianqing Sun, Jiaen Liang

However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis.

Speech Synthesis text-to-speech +2

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

no code implementations10 Jan 2023 Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, JianHua Tao

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality.

Quantization text-to-speech +2

Emotion Selectable End-to-End Text-based Speech Editing

no code implementations20 Dec 2022 Tao Wang, Jiangyan Yi, Ruibo Fu, JianHua Tao, Zhengqi Wen, Chu Yuan Zhang

To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech.

Data Augmentation

IRNet: Iterative Refinement Network for Noisy Partial Label Learning

1 code implementation9 Nov 2022 Zheng Lian, Mingyu Xu, Lan Chen, Licai Sun, Bin Liu, JianHua Tao

In this paper, we relax this assumption and focus on a more general problem, noisy PLL, where the ground-truth label may not exist in the candidate set.

Data Augmentation Partial Label Learning +1

Audio Deepfake Attribution: An Initial Dataset and Investigation

no code implementations21 Aug 2022 Xinrui Yan, Jiangyan Yi, JianHua Tao, Jie Chen

To address the challenges of attribution of continuously emerging unknown audio generation tools in the real world, we propose the Class-Representation Multi-Center Learning (CRML) method for open-set audio deepfake attribution (OSADA).

Audio Generation Binary Classification +2

Fully Automated End-to-End Fake Audio Detection

no code implementations20 Aug 2022 Chenglong Wang, Jiangyan Yi, JianHua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure.

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

1 code implementation16 Aug 2022 Licai Sun, Zheng Lian, Bin Liu, JianHua Tao

With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.

Multimodal Sentiment Analysis Representation Learning

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

no code implementations2 Aug 2022 Jun Xue, Cunhang Fan, Zhao Lv, JianHua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao

Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately.

Audio Deepfake Detection Face Swapping

Two-Aspect Information Fusion Model For ABAW4 Multi-task Challenge

no code implementations23 Jul 2022 Haiyang Sun, Zheng Lian, Bin Liu, JianHua Tao, Licai Sun, Cong Cai

In this paper, we propose the solution to the Multi-Task Learning (MTL) Challenge of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition.

Multi-Task Learning Vocal Bursts Valence Prediction

Adaptive Pseudo-Siamese Policy Network for Temporal Knowledge Prediction

no code implementations26 Apr 2022 Pengpeng Shao, Tong Liu, Feihu Che, Dawei Zhang, JianHua Tao

Specifically, we design the policy network in our model as a pseudo-siamese policy network that consists of two sub-policy networks.

Knowledge Graphs Link Prediction +2

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

no code implementations5 Mar 2022 Tao Wang, Ruibo Fu, Jiangyan Yi, JianHua Tao, Zhengqi Wen

We have also verified through experiments that this method can effectively control the noise components in the predicted speech and adjust the SNR of speech.

GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

1 code implementation4 Mar 2022 Zheng Lian, Lan Chen, Licai Sun, Bin Liu, JianHua Tao

To this end, we propose a novel framework for incomplete multimodal learning in conversations, called "Graph Complete Network (GCNet)", filling the gap of existing works.

Graph Neural Network

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

3 code implementations21 Feb 2022 Tao Wang, Jiangyan Yi, Ruibo Fu, JianHua Tao, Zhengqi Wen

It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript.

Few-Shot Learning Sentence

MixKG: Mixing for harder negative samples in knowledge graph

no code implementations19 Feb 2022 Feihu Che, Guohua Yang, Pengpeng Shao, Dawei Zhang, JianHua Tao

The representations of entities and relations are learned via contrasting the positive and negative triplets.

Knowledge Graph Embedding Knowledge Graphs

Reducing language context confusion for end-to-end code-switching automatic speech recognition

no code implementations28 Jan 2022 Shuai Zhang, Jiangyan Yi, Zhengkun Tian, JianHua Tao, Yu Ting Yeung, Liqun Deng

We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model based on the Equivalence Constraint (EC) Theory.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Knowledge graph enhanced recommender system

no code implementations17 Dec 2021 Zepeng Huai, JianHua Tao, Feihu Che, Guohua Yang, Dawei Zhang

This is attributed to the rich attribute information contained in KG to improve item and user representations as side information.

Attribute Graph Neural Network +1

Multi-Level Graph Contrastive Learning

no code implementations6 Jul 2021 Pengpeng Shao, Tong Liu, Dawei Zhang, JianHua Tao, Feihu Che, Guohua Yang

In this paper, we propose a Multi-Level Graph Contrastive Learning (MLGCL) framework for learning robust representation of graph data by contrasting space views of graphs.

Contrastive Learning Graph Representation Learning +3

FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

no code implementations7 Apr 2021 Zhengkun Tian, Jiangyan Yi, Ye Bai, JianHua Tao, Shuai Zhang, Zhengqi Wen

It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence.

Decoder Position +2

TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

1 code implementation4 Apr 2021 Zhengkun Tian, Jiangyan Yi, JianHua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen, Xuefei Liu

To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model.

Decoder speech-recognition +2

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

no code implementations11 Nov 2020 Cunhang Fan, Bin Liu, JianHua Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song

This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.

Speech Enhancement

Self-supervised Graph Representation Learning via Bootstrapping

no code implementations10 Nov 2020 Feihu Che, Guohua Yang, Dawei Zhang, JianHua Tao, Pengpeng Shao, Tong Liu

In addition, we summarize three kinds of augmentation methods for graph-structured data and apply them to the DGB.

Graph Representation Learning

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

no code implementations9 Nov 2020 Cunhang Fan, Jiangyan Yi, JianHua Tao, Zhengkun Tian, Bin Liu, Zhengqi Wen

The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition

no code implementations28 Oct 2020 Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, JianHua Tao, Zhengqi Wen

In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition

no code implementations28 Oct 2020 Zhengkun Tian, Jiangyan Yi, Ye Bai, JianHua Tao, Shuai Zhang, Zhengqi Wen

Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder and the two-stage inference method into the streaming CTC model.

Decoder Diversity +4

Deep imitator: Handwriting calligraphy imitation via deep attention networks

no code implementations Pattern Recognition 2020 Bocheng Zhao, JianHua Tao, Minghao Yang, Zhengkun Tian, Cunhang Fan, Ye Bai

Calligraphy imitation (CI) from a handful of target handwriting samples is such a challenging task that most of the existing writing style analysis or handwriting generation methods do not exhibit satisfactory performance.

Deep Attention Handwriting generation

Cannot find the paper you are looking for? You can Submit a new open access paper.