Search Results for author: Zhijie Yan

Found 27 papers, 15 papers with code

OmniAudio: Generating Spatial Audio from 360-Degree Video

1 code implementation21 Apr 2025 Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, Shiliang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data.

Audio Generation

SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

no code implementations16 Feb 2025 Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan, Yexin Yang, ZhenHua Ling

This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models.

text-to-speech Text to Speech

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

1 code implementation13 Dec 2024 Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou

By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode.

In-Context Learning Quantization +1

Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study

no code implementations26 Sep 2024 Keyu An, Shiliang Zhang, Zhijie Yan

In this study, we delve into the efficacy of transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

no code implementations7 Jul 2024 Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.

Language Modelling Large Language Model +7

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

1 code implementation28 Mar 2024 Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, Kun Zhan, Peng Jia, Xiaoxiao Long, Yilun Chen, Hao Zhao

However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the domain gap between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the lack of data with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.

3D dense captioning Dense Captioning

Large Language Models Powered Context-aware Motion Prediction in Autonomous Driving

2 code implementations17 Mar 2024 Xiaoji Zheng, Lixiu Wu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, Jiangtao Gong

Traditional methods of motion forecasting primarily encode vector information of maps and historical trajectory data of traffic participants, lacking a comprehensive understanding of overall traffic semantics, which in turn affects the performance of prediction tasks.

Motion Forecasting motion prediction +2

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

2 code implementations7 Oct 2023 Zhihao Du, JiaMing Wang, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features.

Audio captioning Automatic Speech Recognition +15

Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

no code implementations18 May 2023 Xian Shi, Haoneng Luo, Zhifu Gao, Shiliang Zhang, Zhijie Yan

Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies.

speech-recognition Speech Recognition

MUG: A General Meeting Understanding and Generation Benchmark

1 code implementation24 Mar 2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

To prompt SLP advancement, we establish a large-scale general Meeting Understanding and Generation Benchmark (MUG) to benchmark the performance of a wide range of SLP tasks, including topic segmentation, topic-level and session-level extractive summarization and topic title generation, keyphrase extraction, and action item detection.

Extractive Summarization Keyphrase Extraction +1

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

no code implementations24 Mar 2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings.

Extractive Summarization Keyphrase Extraction

Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

1 code implementation29 Jan 2023 Xian Shi, Yanni Chen, Shiliang Zhang, Zhijie Yan

Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability.

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

1 code implementation29 Nov 2022 Xiaohuan Zhou, JiaMing Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, Chang Zhou

Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

2 code implementations16 Jun 2022 Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan

However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus.

Decoder Language Modelling +2

Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

1 code implementation18 Mar 2022 Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings.

Action Detection Activity Detection +3

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

no code implementations16 Feb 2022 Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao

Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV).

text-to-speech Text to Speech

BeamTransformer: Microphone Array-based Overlapping Speech Detection

no code implementations9 Sep 2021 Siqi Zheng, Shiliang Zhang, Weilong Huang, Qian Chen, Hongbin Suo, Ming Lei, Jinwei Feng, Zhijie Yan

We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling.

A Real-time Speaker Diarization System Based on Spatial Spectrum

no code implementations20 Jul 2021 Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan

In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting.

speaker-diarization Speaker Diarization +1

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

1 code implementation21 May 2020 Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention.

Sound Audio and Speech Processing

Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

no code implementations3 Oct 2019 Kai Fan, Jiayi Wang, Bo Li, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhijie Yan

The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition

no code implementations27 Mar 2019 Shiliang Zhang, Ming Lei, Zhijie Yan

Results in a 20, 000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3. 41%, which results in 22. 9% and 53. 2% relative improvement compared to the baseline CTC-based systems decoded with and without language model respectively.

Decoder Language Modeling +6

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

1 code implementation4 Mar 2018 Shiliang Zhang, Ming Lei, Zhijie Yan, Li-Rong Dai

In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM.

Language Modeling Language Modelling +2

Deep Feed-forward Sequential Memory Networks for Speech Synthesis

no code implementations26 Feb 2018 Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei, Zhijie Yan

The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody.

speech-recognition Speech Recognition +3

Cannot find the paper you are looking for? You can Submit a new open access paper.