Search Results for author: Mu Yang

Found 18 papers, 9 papers with code

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition

no code implementations6 Jun 2025 Mu Yang, Szu-Jui Chen, Jiamin Xie, John Hansen

One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Autoregressive Meta-Actions for Unified Controllable Trajectory Generation

no code implementations29 May 2025 Jianbo Zhao, Taiyu Ban, Xiyang Wang, Qibin Zhou, Hangning Zhou, Zhihao Liu, Mu Yang, Lei Liu, Bin Li

Controllable trajectory generation guided by high-level semantic decisions, termed meta-actions, is crucial for autonomous driving systems.

Autonomous Driving Trajectory Prediction

DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling

no code implementations19 Mar 2025 Jianbo Zhao, Taiyu Ban, Zhihao Liu, Hangning Zhou, Xiyang Wang, Qibin Zhou, Hailong Qin, Mu Yang, Lei Liu, Bin Li

We theoretically analyze DRoPE's correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity.

Autonomous Driving Position

UniScene: Unified Occupancy-centric Driving Scene Generation

no code implementations CVPR 2025 Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, Xin Jin

UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling.

Autonomous Driving Scene Generation

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

no code implementations7 Nov 2024 Mu Yang, Bowen Shi, Matthew Le, Wei-Ning Hsu, Andros Tjandra

This work focuses on improving Text-To-Audio (TTA) generation on zero-shot and few-shot settings (i. e. generating unseen or uncommon audio events).

Audio Generation RAG +2

DiariST: Streaming Speech Translation with Speaker Diarization

1 code implementation14 Sep 2023 Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, Jinyu Li, Takuya Yoshioka

End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion.

speaker-diarization Speaker Diarization +3

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

no code implementations10 Jun 2023 Mu Yang, Ram C. M. C. Shekar, Okim Kang, John H. L. Hansen

This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task.

Automatic Speech Recognition Prosody Prediction +3

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

2 code implementations29 Mar 2022 Mu Yang, Kevin Hirschi, Stephen D. Looney, Okim Kang, John H. L. Hansen

We show that fine-tuning with pseudo labels achieves a 5. 35% phoneme error rate reduction and 2. 48% MDD F1 score improvement over a labeled-samples-only fine-tuning baseline.

Phoneme Recognition Pseudo Label +1

InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer

1 code implementation31 Dec 2021 Chin-Tung Lin, Mu Yang

We conduct experiments with human evaluation on VMT, SeqSeq model (our baseline), and the original piano version soundtrack.

Music Generation

Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis

1 code implementation9 Oct 2021 Mu Yang, Shaojin Ding, Tianlong Chen, Tong Wang, Zhangyang Wang

This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system, where each language was seen as an individual task and was learned sequentially and continually.

Lifelong learning Speech Synthesis +3

EventPlus: A Temporal Event Understanding Pipeline

1 code implementation NAACL 2021 Mingyu Derek Ma, Jiao Sun, Mu Yang, Kung-Hsiang Huang, Nuan Wen, Shikhar Singh, Rujun Han, Nanyun Peng

We present EventPlus, a temporal event understanding pipeline that integrates various state-of-the-art event understanding components including event trigger and type detection, event argument detection, event duration and temporal relation extraction.

Common Sense Reasoning Event Extraction +1

Biomedical Event Extraction with Hierarchical Knowledge Graphs

1 code implementation Findings of the Association for Computational Linguistics 2020 Kung-Hsiang Huang, Mu Yang, Nanyun Peng

To better recognize the trigger words, each sentence is first grounded to a sentence graph based on a jointly modeled hierarchical knowledge graph from UMLS.

Event Extraction Language Modeling +1

Headword-Oriented Entity Linking: A Special Entity Linking Task with Dataset and Baseline

no code implementations LREC 2020 Mu Yang, Chi-Yen Chen, Yi-Hui Lee, Qian-hui Zeng, Wei-Yun Ma, Chen-Yang Shih, Wei-Jhih Chen

In this paper, we design headword-oriented entity linking (HEL), a specialized entity linking problem in which only the headwords of the entities are to be linked to knowledge bases; mention scopes of the entities do not need to be identified in the problem setting.

Articles Entity Linking +1

Spoken Language Intent Detection using Confusion2Vec

1 code implementation7 Apr 2019 Prashanth Gurunath Shivakumar, Mu Yang, Panayiotis Georgiou

In this paper, we address the spoken language intent detection under noisy conditions imposed by automatic speech recognition (ASR) systems.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Deep Hybrid Scattering Image Learning

no code implementations19 Sep 2018 Mu Yang, Zheng-Hao Liu, Ze-Di Cheng, Jin-Shi Xu, Chuan-Feng Li, Guang-Can Guo

A well-trained deep neural network is shown to gain capability of simultaneously restoring two kinds of images, which are completely destroyed by two distinct scattering medias respectively.

Cannot find the paper you are looking for? You can Submit a new open access paper.