Search Results for author: Yuan Gong

Found 42 papers, 27 papers with code

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

no code implementations2 Mar 2025 Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks.

Decoder Representation Learning +6

State-Space Large Audio Language Models

no code implementations24 Nov 2024 Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent.

State Space Models

A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation

no code implementations29 Oct 2024 Alexander H. Liu, Qirui Wang, Yuan Gong, James Glass

The discrete and low-frequency nature of neural codecs introduced a new way to generate speech with token-based models.

Resynthesis

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models

no code implementations26 Sep 2024 Xin Hong, Yuan Gong, Vidhyasaharan Sethu, Ting Dang

Recent advancements in Large Language Models (LLMs) have demonstrated great success in many Natural Language Processing (NLP) tasks.

Emotional Intelligence Emotion Recognition +1

DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

1 code implementation4 Jul 2024 Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

To the best of our knowledge, it is the first SSM that outperforms the Transformers on AudioSet and achieves an mAP of 48. 9; and 2) We designed a new test called Audio Needle In A Haystack (Audio NIAH).

Audio Classification Audio Tagging +3

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

1 code implementation26 Jun 2024 Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass

Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches.

Generic Knowledge Boosted Pre-training For Remote Sensing Images

1 code implementation9 Jan 2024 Ziyue Huang, Mingming Zhang, Yuan Gong, Qingjie Liu, Yunhong Wang

Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks.

Change Detection Deep Learning +5

Joint Audio and Speech Understanding

1 code implementation25 Sep 2023 Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass

Humans are surrounded by audio signals that include both speech and non-speech sounds.

ToonTalker: Cross-Domain Face Reenactment

no code implementations ICCV 2023 Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang

Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint.

Face Reenactment Talking Face Generation

TaleCrafter: Interactive Story Visualization with Multiple Characters

1 code implementation29 May 2023 Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images.

Layout Generation Story Visualization +2

SAIL: Search-Augmented Instruction Learning

no code implementations24 May 2023 Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass

Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information.

Denoising Fact Checking +3

Listen, Think, and Understand

1 code implementation18 May 2023 Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities.

Ranked #3 on Music Question Answering on MusicQA (using extra training data)

Language Modelling Large Language Model +1

3D GAN Inversion with Facial Symmetry Prior

no code implementations CVPR 2023 Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li, Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz Oztireli, Yujiu Yang

It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion.

3D geometry Image Reconstruction +1

Contrastive Audio-Visual Masked Autoencoder

1 code implementation2 Oct 2022 Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.

 Ranked #1 on Audio Tagging on AudioSet (using extra training data)

Audio Classification Audio Tagging +6

Rethinking Knowledge Distillation via Cross-Entropy

1 code implementation22 Aug 2022 Zhendong Yang, Zhe Li, Yuan Gong, Tianke Zhang, Shanshan Lao, Chun Yuan, Yu Li

Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD).

Knowledge Distillation

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

1 code implementation6 May 2022 Yuan Gong, Jin Yu, James Glass

Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring.

Audio Classification

MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment

2 code implementations19 Apr 2022 Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, Yujiu Yang

No-Reference Image Quality Assessment (NR-IQA) aims to assess the perceptual quality of images in accordance with human subjective perception.

Focal and Global Knowledge Distillation for Detectors

1 code implementation CVPR 2022 Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, Chun Yuan

Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation.

image-classification Image Classification +3

SSAST: Self-Supervised Audio Spectrogram Transformer

3 code implementations19 Oct 2021 Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass

However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST.

Audio Classification Emotion Recognition +4

AST: Audio Spectrogram Transformer

5 code implementations5 Apr 2021 Yuan Gong, Yu-An Chung, James Glass

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels.

Audio Classification Audio Tagging +4

Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method

2 code implementations18 Mar 2020 Yuan Gong, Jian Yang, Christian Poellabauer

With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks.

Second-order Non-local Attention Networks for Person Re-identification

no code implementations31 Aug 2019 Bryan, Xia, Yuan Gong, Yizhe Zhang, Christian Poellabauer

Recent efforts have shown promising results for person re-identification by designing part-based architectures to allow a neural network to learn discriminative representations from semantically coherent parts.

Person Re-Identification

Real-Time Adversarial Attacks

1 code implementation31 May 2019 Yuan Gong, Boyang Li, Christian Poellabauer, Yiyu Shi

In recent years, many efforts have demonstrated that modern machine learning algorithms are vulnerable to adversarial attacks, where small, but carefully crafted, perturbations on the input can make them fail.

Adversarial Attack BIG-bench Machine Learning

ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems

2 code implementations6 Apr 2019 Yuan Gong, Jian Yang, Jacob Huber, Mitchell MacKnight, Christian Poellabauer

This paper introduces a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems (VCSs).

Voice Anti-spoofing

Topic Modeling Based Multi-modal Depression Detection

no code implementations28 Mar 2018 Yuan Gong, Christian Poellabauer

Major depressive disorder is a common mental disorder that affects almost 7% of the adult U. S. population.

Depression Detection

An Overview of Vulnerabilities of Voice Controlled Systems

no code implementations24 Mar 2018 Yuan Gong, Christian Poellabauer

These systems have been shown to be vulnerable to various types of voice spoofing attacks.

General Classification

How do deep convolutional neural networks learn from raw audio waveforms?

no code implementations ICLR 2018 Yuan Gong, Christian Poellabauer

Prior work on speech and audio processing has demonstrated the ability to obtain excellent performance when learning directly from raw audio waveforms using convolutional neural networks (CNNs).

Crafting Adversarial Examples For Speech Paralinguistics Applications

no code implementations9 Nov 2017 Yuan Gong, Christian Poellabauer

Computational paralinguistic analysis is increasingly being used in a wide range of cyber applications, including security-sensitive applications such as speaker verification, deceptive speech detection, and medical diagnostics.

Medical Diagnosis Speaker Verification

Cannot find the paper you are looking for? You can Submit a new open access paper.