no code implementations • 16 Jun 2025 • Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu, Zihan Yin, Zijian Ma, Zhiwen Mo
We introduce xbench, a dynamic, profession-aligned evaluation suite designed to bridge the gap between AI agent capabilities and real-world productivity.
1 code implementation • CVPR 2025 • Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens.
no code implementations • 2 Mar 2025 • Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro
We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks.
no code implementations • 24 Nov 2024 • Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent.
no code implementations • 29 Oct 2024 • Alexander H. Liu, Qirui Wang, Yuan Gong, James Glass
The discrete and low-frequency nature of neural codecs introduced a new way to generate speech with token-based models.
no code implementations • 26 Sep 2024 • Xin Hong, Yuan Gong, Vidhyasaharan Sethu, Ting Dang
Recent advancements in Large Language Models (LLMs) have demonstrated great success in many Natural Language Processing (NLP) tasks.
1 code implementation • 23 Sep 2024 • Yuanchao Li, Yuan Gong, Chao-Han Huck Yang, Peter Bell, Catherine Lai
Furthermore, we propose a Revise-Reason-Recognize prompting pipeline for robust LLM-based emotion recognition from spoken language with ASR errors.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
no code implementations • 15 Sep 2024 • Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • 4 Jul 2024 • Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
To the best of our knowledge, it is the first SSM that outperforms the Transformers on AudioSet and achieves an mAP of 48. 9; and 2) We designed a new test called Audio Needle In A Haystack (Audio NIAH).
Ranked #29 on
Audio Classification
on AudioSet
1 code implementation • 26 Jun 2024 • Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass
Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches.
1 code implementation • 14 Jun 2024 • Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass
Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours.
Ranked #1 on
Audio-Visual Speech Recognition
on LRS2
(using extra training data)
Audio-Visual Speech Recognition
Automatic Speech Recognition (ASR)
+5
1 code implementation • 9 Jan 2024 • Ziyue Huang, Mingming Zhang, Yuan Gong, Qingjie Liu, Yunhong Wang
Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks.
1 code implementation • 25 Sep 2023 • Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass
Humans are surrounded by audio signals that include both speech and non-speech sounds.
1 code implementation • 19 Sep 2023 • Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass
How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning?
no code implementations • ICCV 2023 • Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, Yujiu Yang
Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint.
1 code implementation • 13 Jul 2023 • Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
1 code implementation • 29 May 2023 • Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang
Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images.
no code implementations • 24 May 2023 • Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass
Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information.
1 code implementation • 18 May 2023 • Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass
On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities.
Ranked #3 on
Music Question Answering
on MusicQA
(using extra training data)
no code implementations • CVPR 2023 • Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li, Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz Oztireli, Yujiu Yang
It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion.
1 code implementation • CVPR 2023 • Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets.
1 code implementation • 2 Oct 2022 • Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities.
Ranked #1 on
Audio Tagging
on AudioSet
(using extra training data)
1 code implementation • 22 Aug 2022 • Zhendong Yang, Zhe Li, Yuan Gong, Tianke Zhang, Shanshan Lao, Chun Yuan, Yu Li
Furthermore, we smooth students' target output to treat it as the soft target for training without teachers and propose a teacher-free new KD loss (tf-NKD).
1 code implementation • 29 Jul 2022 • Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass
Conventional audio-visual models have independent audio and video branches.
Ranked #2 on
Multi-modal Classification
on AudioSet
(using extra training data)
1 code implementation • 6 May 2022 • Yuan Gong, Jin Yu, James Glass
Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring.
Ranked #1 on
Audio Classification
on VocalSound
1 code implementation • 6 May 2022 • Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass
Automatic pronunciation assessment is an important technology to help self-directed language learners.
Ranked #3 on
Phone-level pronunciation scoring
on speechocean762
(using extra training data)
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
3 code implementations • 22 Apr 2022 • Shanshan Lao, Yuan Gong, Shuwei Shi, Sidi Yang, Tianhe Wu, Jiahao Wang, Weihao Xia, Yujiu Yang
Image quality assessment (IQA) algorithm aims to quantify the human perception of image quality.
Ranked #1 on
Image Quality Assessment
on MSU FR VQA Database
2 code implementations • 19 Apr 2022 • Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, Yujiu Yang
No-Reference Image Quality Assessment (NR-IQA) aims to assess the perceptual quality of images in accordance with human subjective perception.
Ranked #8 on
Video Quality Assessment
on MSU SR-QA Dataset
2 code implementations • 13 Mar 2022 • Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass
Audio classification is an active research area with a wide range of applications.
1 code implementation • CVPR 2022 • Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, Chun Yuan
Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation.
3 code implementations • 19 Oct 2021 • Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass
However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST.
Ranked #1 on
Spoken Command Recognition
on Speech Command v2
5 code implementations • 5 Apr 2021 • Yuan Gong, Yu-An Chung, James Glass
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels.
Ranked #2 on
Audio Classification
on Speech Commands
1 code implementation • 2 Feb 2021 • Yuan Gong, Yu-An Chung, James Glass
Audio tagging is an active research area and has a wide range of applications.
Ranked #6 on
Audio Classification
on FSD50K
(using extra training data)
2 code implementations • 18 Mar 2020 • Yuan Gong, Jian Yang, Christian Poellabauer
With the rapidly growing number of security-sensitive systems that use voice as the primary input, it becomes increasingly important to address these systems' potential vulnerability to replay attacks.
no code implementations • 31 Aug 2019 • Bryan, Xia, Yuan Gong, Yizhe Zhang, Christian Poellabauer
Recent efforts have shown promising results for person re-identification by designing part-based architectures to allow a neural network to learn discriminative representations from semantically coherent parts.
1 code implementation • 31 May 2019 • Yuan Gong, Boyang Li, Christian Poellabauer, Yiyu Shi
In recent years, many efforts have demonstrated that modern machine learning algorithms are vulnerable to adversarial attacks, where small, but carefully crafted, perturbations on the input can make them fail.
2 code implementations • 6 Apr 2019 • Yuan Gong, Jian Yang, Jacob Huber, Mitchell MacKnight, Christian Poellabauer
This paper introduces a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems (VCSs).
no code implementations • 8 Aug 2018 • Yuan Gong, Christian Poellabauer
Learning disentangled representations of high-dimensional data is currently an active research area.
no code implementations • 28 Mar 2018 • Yuan Gong, Christian Poellabauer
Major depressive disorder is a common mental disorder that affects almost 7% of the adult U. S. population.
no code implementations • 24 Mar 2018 • Yuan Gong, Christian Poellabauer
These systems have been shown to be vulnerable to various types of voice spoofing attacks.
no code implementations • ICLR 2018 • Yuan Gong, Christian Poellabauer
Prior work on speech and audio processing has demonstrated the ability to obtain excellent performance when learning directly from raw audio waveforms using convolutional neural networks (CNNs).
no code implementations • 9 Nov 2017 • Yuan Gong, Christian Poellabauer
Computational paralinguistic analysis is increasingly being used in a wide range of cyber applications, including security-sensitive applications such as speaker verification, deceptive speech detection, and medical diagnostics.